Stationary Points of a Shallow Neural Network with Quadratic Activations and the Global Optimality of the Gradient Descent Algorithm

Published Online:https://doi.org/10.1287/moor.2021.0082

Abstract

We consider the problem of training a shallow neural network with quadratic activation functions and the generalization power of such trained networks. Assuming that the samples are generated by a full rank matrix W* of the hidden network node weights, we obtain the following results. We establish that all full-rank approximately stationary solutions of the risk minimization problem are also approximate global optimums of the risk (in-sample and population). As a consequence, we establish that, when trained on polynomially many samples, the gradient descent algorithm converges to the global optimum of the risk minimization problem regardless of the width of the network when it is initialized at some value ν*, which we compute. Furthermore, the network produced by the gradient descent has a near zero generalization error. Next, we establish that initializing the gradient descent algorithm below ν* is easily achieved when the weights of the ground truth matrix W* are randomly generated and the matrix is sufficiently overparameterized. Finally, we identify a simple necessary and sufficient geometric condition on the size of the training set under which any global minimizer of the empirical risk has necessarily zero generalization error.

Funding: The research of E. C. Kizildag is supported by Columbia University, with the Distinguished Postdoctoral Fellowship in Statistics. Support from the National Science Foundation [Grant DMS-2015517] is gratefully acknowledged.

1. Introduction

Neural network architectures are demonstrated to be extremely powerful in practical tasks, such as natural language processing (Collobert and Weston [20]), image recognition (He et al. [46]), image classification (Krizhevsky et al. [53]), speech recognition (Mohamed et al. [63]), and game playing (Silver et al. [80]), and is becoming popular in other areas, such as applied mathematics (Chen et al. [17], Weinan et al. [94]), clinical diagnosis (De Fauw et al. [21]), and so on. Despite this empirical success, a rigorous mathematical understanding of these architectures is still an ongoing quest.

Despite the fact that it is NP-hard to train such architectures in the worst case setting, it has been observed empirically that the gradient descent (GD), albeit a simple, first order, local procedure, is rather successful in training such networks. This is somewhat surprising because of the highly nonconvex nature of the associated objective function. Our main motivation in this paper is to provide further insights into the optimization landscape and generalization abilities of these networks.

1.1. Model, Contributions, and Comparison with Prior Work

In this section, we introduce the model considered in this paper, describe our contributions, and discuss the relevant literature.

1.1.1. Model.

In this paper, we consider a shallow neural network architecture with one hidden layer of width m. Namely, the network consists of m neurons. We study it under the realizable model assumption, that is, the labels are generated by a teacher network with ground truth weight matrix W*Rm×d whose jth row Wj*Rd carries the weights of the jth neuron and md. We assume that the input data XRd consists of independent and identically distributed (i.i.d.) centered sub-Gaussian coordinates. It is worth noting that such shallow architectures with planted weights and Gaussian input data are explored extensively in the literature (see, e.g., Brutzkus and Globerson [11], Du et al. [28], Li and Yuan [56], Soltanolkotabi [82], Tian [89], Zhong et al. [99]).

Our focus is, in particular, on networks with quadratic activation, studied also by Soltanolkotabi et al. [83] and Du and Lee [24], among others. This object, an instance of what is known as a polynomial network (Livni et al. [59]), computes for every input data XRd the function

f(W*;X)=j=1mWj*,X2=W*X22.(1)

We note that, albeit a stylized activation function, blocks of quadratic activations can be stacked together to approximate deeper networks with sigmoid activations as shown by Livni et al. [59], and furthermore, this activation serves as a second order approximation of general nonlinear activations as noted by Venturi et al. [90]. Thus, we study the quadratic networks as an attempt to gain further insights on more complex networks.

Let XiRd,1iN be an i.i.d. collection of input data, and let Yi=f(W*;Xi) be the corresponding label generated per (1). The goal of the learner is as follows: given the training data (Xi,Yi)Rd×R,1iN, find a weight matrix WRm×d that explains the input–output relationship on the training data set in the best possible way, often by solving the so-called empirical risk minimization (ERM) optimization problem

minWRm×d L^(W)whereL^(W)1N1iN(Yif(W;Xi))2;(2)
understand its generalization ability, quantified by the generalization error (also known as the population risk associated with any solution candidate WRm×d) that is given by
L(W)E[(f(W*;X)f(W;X))2],(3)
where the expectation is with respect to a fresh sample X, which has the same distribution as Xi,1iN but is independent from the sample. The landscape of the loss function L^(·) is nonconvex, therefore rendering the optimization problem (potentially) difficult. Nevertheless, the gradient descent algorithm, despite being a simple first order procedure, is rather successful in training neural networks in general: it appears to find a WRm×d with near-optimal L^(W). Our partial motivation is to investigate this phenomenon in the case in which the activation function is quadratic.

1.1.2. Contributions.

We first study the landscape of risk functions and quantify an energy barrier separating rank-deficient matrices from the full-rank planted weights. Specifically, if W*Rm×d is full-rank, namely, rank d (recall md), then the risk function for any rank-deficient W is bounded away from zero by an explicit constant—independent of d—controlled by the smallest singular value σmin(W*) of W* as well as the second and fourth moments of the data. See Theorem 1 for the population and Theorem 2 for the empirical versions of this result. (Theorem 2 holds with high probability (w.h.p.) with respect to the observed sample.)

Next, we study the full-rank stationary points of the risk functions and the performance of the gradient descent algorithm. We first establish that, when W* is full rank, any full-rank stationary point W of the risk functions is necessarily a global minimum and any such W is of form W=QW*, where QRm×m is orthonormal. See Theorem 3 for the population and Theorem 4 for the empirical versions. Namely, W is a global optimum up to a rotation. We then establish that all approximate stationary points (appropriately defined) W of L^(·) below the aforementioned energy barrier are nearly global optimum. Furthermore, we establish that, if the number N of samples is poly(d), then the weights W of any full-rank approximate stationary point are uniformly close to W*. As a corollary, gradient descent with initialization below the energy barrier recovers in time poly(ϵ1,d) a solution W for which the weights are ϵ-close to the planted weights. Consequently, the generalization error L(W) for this solution W is at most ϵ. The bound on L(W) is derived by controlling the condition number of a certain matrix whose i.i.d. rows consists of tensorized data Xi2, using a recently developed machinery in Emschwiller et al. [33] studying the spectrum of expected covariance matrices of tensorized data. See Theorem 5 for the population and Theorem 6 for the empirical version.

Subsequently, we study the question of whether one can find the initialization of the gradient descent algorithm below the aforementioned energy barrier. We answer affirmatively this question in the context of randomly generated W*Rm×d and establish in Theorem 8 that, as long as the network is sufficiently overparameterized, specifically m>Cd2, for some sufficiently large constant C, it is possible to initialize W0 such that w.h.p. the risk associated to W0 is below the required threshold. This is achieved by using tools from random matrix theory, specifically a semicircle law for Wishart matrices that shows the spectrum of (W*)TW* is tightly concentrated (Bai and Yin [3]). See Theorem 7 for the population and Theorem 8 for the empirical version. It is worth noting that neural networks with random weights is an active area of research by itself because of the relationship with random feature methods. For example, Rahimi and Recht [73] show that shallow architectures trained by choosing the internal weights randomly and optimizing only over the output weights return a classifier with reasonable generalization performance at accelerated training speed. Random shallow networks are also shown to well-approximate dynamical systems (Gonon et al. [40]), have been successfully employed in the context of extreme learning machines (Huang et al. [48]), and are studied in the context of random matrix theory; see Pennington and Worah [69] and references therein.

Our next focus is on the sample complexity for generalization. Even though we study the landscape of the empirical risk, it is not by any means certain that any (potentially not full-rank) optimizer of minW L^(W) also achieves zero generalization error. We give necessary and sufficient conditions on the samples Xi,1iN so that any minimizer has indeed zero generalization error in our setting. We show that, if span(XiXiT:1iN) is the space of all d × d-dimensional real symmetric matrices, then any global minimum of the empirical risk is necessarily a global optimizer of the population risk and, thus, has zero generalization error. Note that this geometric condition is not retrospective in manner: it can be checked ahead of the optimization procedure by computing span(XiXiT:1iN). Conversely, we show that, if the preceding span condition is not met, then there exists a global minimum W of the empirical risk function that induces a strictly positive generalization error. This is established in Theorem 9.

To complement our analysis, we then ask the following question: what is the critical number N* of the training samples under which the (random) data Xi,1iN enjoys the aforementioned span condition? We prove this number to be N*=d(d+1)/2 under a very mild assumption that the coordinates of XiRd are jointly continuous. This is shown in Theorem 10. Finally in Theorem 11, we show that, when N<N*, not only does there exist W with zero empirical risk and strictly positive generalization error, we also bound this error from below by an amount very similar to the bound for rank-deficient matrices discussed in our earlier Theorem 2.

We end with a comment on overparameterization and generalization. A common paradigm in statistical learning theory is that overparameterized models, that is, models with more parameters than necessary, although capable of interpolating the training data, tend to generalize poorly because of overfitting to the proposed model. Yet it is observed empirically that neural networks tend not to suffer from this complication (Zhang et al. [96]): despite being overparameterized, they seem to have a good generalization performance provided the interpolation barrier is exceeded. In Theorem 9(a), we establish the following result, which sheds some light on this phenomenon for the case of shallow neural networks with quadratic activations: suppose that the data enjoys the aforementioned geometric condition. Then, any interpolator achieves zero generalization error even when the interpolator is a neural network with a potentially larger number m^ of internal nodes compared with the one that generated the data, namely, by using a weight matrix WRm^×d, where m^m. In other words, the model does not overfit when a much larger width of the interpolator is chosen at the learning state.

1.1.3. Comparison with Soltanolkotabi et al. [83] and Du and Lee [24].

We now make a comparison with two very related prior works, also studying quadratic activations. We start with the work by Soltanolkotabi et al. [83]. In Soltanolkotabi et al. [83, theorem 2.2], the authors study the empirical risk landscape of a slightly more general version of our model: Yi=j=1mvj*Wj*,Xi2, assuming rank(W*)=d like us and assuming all nonzero entries of v* have the same sign. Thus, our model is the special case in which all entries of v* are equal to unity. The authors establish that, as long as dNcd2 for some small fixed constant c, every local minima of the empirical risk function is also a global minima (namely, there exists no spurious local minima), and furthermore, every saddle point has a direction of negative curvature. As a result, they show that gradient descent with an arbitrary initialization converges to a globally optimum solution of the ERM problem (2). In particular, their result does not require the initialization point to be below some risk value (the energy barrier) as in our case. Nevertheless, our results show that one needs not to worry about saddle points below the energy barrier as none exists per our Theorem 2. Importantly, though, the regime N<cd2 for small c that Soltanolkotabi et al. [83, theorem 2.2] applies is below the provable sample complexity value N*=d(d+1)/2 when the data are drawn from a continuous distribution as per our Theorem 10. In particular, as we establish when N<N*, the ERM problem (2) admits global optimum solutions with zero empirical risk value but with the generalization error bounded away from zero. Thus, the regime N<N* does not correspond to the regime in which solving the ERM has a guaranteed control on the generalization error. The same theorem in Soltanolkotabi et al. [83] also studies the approximate stationary points and shows that, for any such point W, the associated empirical risk, L^(W), is also small. Our Theorem 6, though, takes a step further and shows that not only is the empirical risk small, but the recovered W is close to planted weights W*, and therefore, it has a small generalization error L(W) by explicitly bounding the generalization error from above.

It is also worth noting that, although not our focus in the present paper, Soltanolkotabi et al. [83, theorem 2.1] also studies the landscape of the empirical risk when a quadratic network model Xj=1mvj*Wj*,X2 is used for interpolating arbitrary input/label pairs (Xi,Yi)Rd×R,1iN, that is, without making an assumption that the labels are generated according to a network with planted weights. They establish similar landscape results, namely, the absence of spurious local minima and the fact that every saddle point has a direction of negative curvature as long as the output weight v* has at least d positive and at least d negative entries (consequently, the width m has to be at least 2d). Even though this result does not assume any rank condition on W, the assumption on the minimum number of positive and minimum number of negative entries, such as the preceding one, is somewhat unnatural.

Yet another closely related work studying quadratic activations is the paper by Du and Lee [24], which focuses on the shallow architectures with all unity output weights as we do. This paper establishes that, for any smooth and convex loss (·,·), the landscape of the regularized loss function 1Ni=1N(f(W;Xi),Yi)+λ2WF2 still admits aforementioned favorable geometric characteristics. Furthermore, as the learned weights are of bounded Frobenius norm thanks to the norm penalty WF2 imposed on the objective, they retain good generalization via Rademacher complexity considerations. Even though this work addresses the training and generalization error when the norm of W is controlled during training, it does not carry out approximate stationarity analysis as Soltanolkotabi et al. [83] and we do and does not study their associated loss/generalization as in our case. Even though they show that optimal solutions to the optimization problem incorporating bounded norms generalize well, it remains unclear from their analysis whether the approximate stationary points of this objective also have a well-controlled norm.

It is worth mentioning that the two main directions that we undertake in this paper were not explored in either of these two prior works. These include the direction pertaining to the initialization (Theorems 7 and 8) and the direction pertaining to the sample complexity (Theorems 911). The latter direction relates to an interesting interpolation/overparameterization property that we have discussed before. We return later to this direction in Section 2.3 after we present Theorem 9.

1.1.4. Connection to Matrix Sensing.

The problem of learning shallow quadratic networks with planted weights W* is closely related to the matrix sensing problem. This problem has many applications, spanning from video and image processing to control and sensor networks (see Deng et al. [23], Qin et al. [72], Zhong et al. [97] and references therein); we now elaborate on the connection between our setting and the matrix sensing problem. Given an unknown matrix A*Rd1×d2, the goal of the matrix sensing problem is to recover A* from a (small) number of linear measurements of form Zi,A*,1iN, where ZiRd1×d2,1iN, and Zi,A*=trace(ZiTA*). Recalling (1), our data model corresponds to

Yi=W*Xi22=XiT(W*)TW*Xi=(W*)TW*,XiXiT.

Namely, the problem that we study in the present paper is an instance of the matrix sensing problem; the goal is to recover A*=(W*)TW* from rank-1 measurements of form XiXiT, where XiRd have centered i.i.d. sub-Gaussian coordinates. A related prior work by Zhong et al. [97] studies the problem of sensing a low-rank matrix from rank-1 measurements of form XiYiT, where Xi,YiRd are i.i.d. and they consist of centered i.i.d. sub-Gaussian coordinates (see also Deng et al. [23] for an improvement of Zhong et al. [97] in terms of the time and sample complexity). Notice that Zhong et al. [97] studies a slightly different measurement matrix of form XiYiT (as opposed to XiXiT). More importantly Zhong et al. [97] are concerned with recovering a low-rank A*, whereas the fact that (W*)TW* is full-rank is crucial for our techniques; see subsequent details. A very recent work by Qin et al. [72] relaxes the low-rank assumption and studies measurement matrices of form uiuiT for uiRd that are nearly orthogonal and give recovery guarantees for the stochastic gradient descent. Note that, for XiRd as before, standard concentration arguments show that the collection Xi, 1iN, consists (w.h.p.) of nearly orthogonal vectors. Therefore, the setting studied in Qin et al. [72] is very similar to ours, and their sample complexity guarantee regarding the gradient descent appears to have a better dependence on dimension than ours. On the other hand Qin et al. [72] focus only on the problem of finding an A^ such that

1iN(A^,XiXiTA*,XiXiT)2
is small; it is not clear from their analysis whether A^ is also close to ground truth A*, that is, whether A^A* is small. This is one of the main directions we explore in the present paper. For more on the matrix sensing problem, see Recht et al. [74], Zhong et al. [97], Deng et al. [23], Qin et al. [72], and the references therein.

1.1.5. Further Relevant Prior Work.

As noted in the introduction, neural networks achieved remarkable empirical success, which fueled research starting from the expressive ability of these networks, going as early as Barron [5]. More recent works along this front focus on deeper and sparser models; see, for example, Mhaskar et al. [62], Telgarsky [88], Eldan and Shamir [31], Schmidt-Hieber [78], Poggio et al. [70], and Bölcskei et al. [10]. In particular, the expressive power of such network architectures is relatively well understood. Another issue pertaining to such architectures is their computational tractability: Blum and Rivest [9] establish that it is NP-complete to train a very simple three-node network whose nodes compute a linear thresholding function. Despite this worst case result, it is observed empirically that local search algorithms, such as GD, are rather successful in training. Even though several authors, including Sedghi and Anandkumar [79], Janzamin et al. [49], and Goel et al. [38], devise provable training algorithms for such networks, these algorithms unfortunately are based on methods other than the gradient descent, thus not shedding any light on its apparent empirical success.

On a parallel front, many papers study the behavior of the GD by analyzing the trajectory of it or its stochastic variant (SGD) under certain stylistic assumptions on the data as well as the network. These assumptions include Gaussian inputs, shallow networks (with or without the convolutional structure), and the existence of planted weights (the so-called teacher network) generating the labels. Some partial and certainly very incomplete references to this end include Tian [89], Brutzkus and Globerson [11], Brutzkus et al. [12], Zhong et al. [98], Soltanolkotabi [82], Li and Yuan [56], and Du et al. [28]. Later work relaxes the distributional assumptions. For instance, Du et al. [25] study the problem of learning a convolutional unit with ReLU with no specific distributional assumption on input and establish the convergence of SGD with rate depending on the smoothness of the input distribution and the closeness of the patches. Several other works along this line, in particular under the presence of overparameterization, are the works by Du et al. [26, 27].

Yet another line of research on the optimization front, rather than analyzing the trajectory of the GD, focuses on the mean-field analysis. Empirical distribution of the parameters of network with infinitely many internal nodes can be described as a Wasserstein gradient flow, and thus, some tools from the theory of optimal transport can be used; see, for example, Wei et al. [93], Rotskoff and Vanden-Eijnden [75], Chizat and Bach [18], Song et al. [84], and Sirignano and Spiliopoulos [81]. Albeit explaining the story to some extent for infinitely wide networks, it remains unclear whether these techniques provide results for a more realistic network model with finitely many internal nodes.

As noted earlier, the optimization landscape of such networks is usually highly nonconvex. More recent research on such nonconvex objectives shows that, if the landscape has certain favorable geometric properties, such as the absence of spurious local minima and the existence of direction with negative curvature for every saddle point, local methods can escape the saddle points and converge to the global minima. Examples of this line of research on loss functions include Ge et al. [37], Levy [55], Lee et al. [54], Jin et al. [50], and Du et al. [29]. Motivated by this front of research, many papers analyze geometric properties of the optimization landscape, including Poston et al. [71], Haeffele et al. [43], Choromanska et al. [19], Haeffele and Vidal [42], Kawaguchi [51], Hardt and Ma [44], Soudry and Carmon [85], Freeman and Bruna [34], Zhou and Feng [100], Nguyen and Hein [67, 68], Ge et al. [36], Safran and Shamir [77], Soudry and Hoffer [86], Zhou and Liang [101], Venturi et al. [90], Du et al. [24], and Soltanolkotabi et al. [83].

We now touch upon yet another very important focus, that is, the generalization ability of such networks: how well a solution found, for example, by GD, predicts unseen data? A common paradigm in statistical learning theory that was mentioned previously is that overparameterized models tend to generalize poorly. Yet neural networks tend to not suffer from this complication (Zhang et al. [96]). Because the Vapnik–Chervonenkis dimension of these networks grows (at least) linear in the number of parameters (Bartlett et al. [7], Harvey et al. [45]), standard Vapnik–Chervonenkis theory does not help explaining the good generalization ability under presence of overparameterization. This is studied, among others, through the lens of the norm of weight matrices (Bartlett et al. [6], Dziugaite and Roy [30], Golowich et al. [39], Liang et al. [58], Neyshabur et al. [65], Wu et al. [95]), PAC-Bayes theory (Neyshabur et al. [64, 66]), and compression-based bounds (Arora et al. [1]). A main drawback is that these papers require some sort of constraints on the weights and are mostly a posteriori: whether a good generalization takes place can be determined only when the training process is finished. A recent work by Arora et al. [2] provides an a priori guarantee for the solution found by the GD. Our result regarding the generalization guarantee described in Theorem 11 also provides simple a priori guarantee on the generalization.

1.1.6. A Follow-up Work.

After our paper appeared on arXiv, a follow-up work was done by Mannelli et al. [60]. In this paper, the authors consider the same architecture (namely, a shallow network with quadratic activations) under the so-called teacher/student setting and study the landscape of the empirical risk as well as the performance of the gradient flow, the continuous-time analogue of the gradient descent. Importantly, they consider also the regime in which the number m* of the hidden units of the teacher network is less than the dimension d (whereas our focus is on m*d). In particular, when m*=1, the width m of the student network is at least d, and the data consists of i.i.d. standard normal entries; they prove the following. In the limit as d, if n>2d, then with positive probability the only minimizer of the empirical risk is the matrix A* of teacher weights itself, whereas for n<2d, the empirical risk admits spurious minima with probability tending to one. (Namely, the geometry of the empirical risk undergoes a phase transition as αn/d crosses αc=2). Moreover, they also prove that, for md, the gradient flow converges to a global minima of the empirical risk and to the global minimum of the population risk (which is A*) and characterize the rate of convergence for the latter case. (It is worth noting that running gradient flow on the population risk can be perceived as running it on the empirical risk in the limit of the large number n of samples.)

1.1.7. Paper Organization.

In Section 2.1, we present our main results on the landscape of the risk functions, including our energy barrier result for rank-deficient matrices, our result about the absence of full-rank stationary points of the risk function except the globally optimum points, and our result on the convergence of gradient descent. In Section 2.2, we present our results regarding randomly generated weight matrices W* and sufficient conditions for good initializations. In Section 2.3, we study the critical number of training samples guaranteeing good generalization property. We collect useful auxiliary lemmas in Section 3 and provide the proofs of all of our results in Section 4.

1.1.8. Notation.

The set of reals, positive reals, and the set {1,2,,k} are denoted, respectively, by R,R+, and [k]. For any matrix A, its smallest and largest singular values, spectrum, trace, Frobenius, and spectral norm are denoted, respectively, by σmin(A),σmax(A),σ(A),trace(A),AF, and A2. We denote the n × n identity matrix by In. Planted weights are denoted with an asterisk, for example, W*. exp(α) denotes eα. Given any vRn,v2 denotes its Euclidean 2 norm 1invi2. Given two vectors x,yRn, their Euclidean inner product 1inxiyi is denoted by x,y. Given a collection Z1,,Zk of objects of the same kind (e.g., vectors or matrices), span(Zi:i[k]) is the set, {j=1kαjZj:αjR}. We say a random variable X is centered if E[X]=0. Θ(·),O(·),o(·), and Ω(·) are standard (asymptotic) order notations for comparing the growth of two sequences. L^(·),L^(·),L, and L denote, respectively, the empirical risk, its gradient, and the population risk and its gradient.

2. Main Results

Our main results are now in order.

2.1. Optimization Landscape

2.1.1. Existence of an Energy Barrier.

Our first result shows the presence of an energy barrier in the landscape of the population risk L(·) below which any rank-deficient WRm×d ceases to exist.

Theorem 1.

Suppose that XRd has i.i.d. centered coordinates with variance μ2, (finite) fourth moment μ4, rank(W*)=d, and let L(W) be defined as (3).

  1. (Lower bound) It holds that

    minWRm×d:rank(W)<d L(W)min{μ4μ22,2μ22}·σmin(W*)4.

  2. (Tightness) There exists a matrix WRm×d such that rank(W)d1 and

    L(W)max{μ4,3μ22}·σmin(W*)4.

The proof of Theorem 1 is deferred to Section 4.2. Two remarks are in order.

First, the hypothesis of Theorem 1 holds under mild distributional assumptions on the coordinates of data: a finite fourth moment and zero mean suffices.

Second, part (b) of Theorem 1 implies that our lower bound on the energy value is tight up to a multiplicative constant determined by the moments of the data. That is, there exists a W with rank(W)d1 such that L(W)=Θ(σmin(W*)4), where the asymptotic Θ(·) hides the constants μ2 and μ4.

Our next result is an analogue of Theorem 1 for the empirical risk L^(·), and it establishes the presence of a similar energy barrier in the landscape of the empirical risk L^(·) below which any rank-deficient WRm×d ceases to exist with high probability.

Theorem 2.

Let K > 0 be an arbitrary constant and XiRd,1iN be a collection of i.i.d. random vectors each having centered i.i.d. sub-Gaussian coordinates. That is, for some C > 0, P(|X(j)|>t)exp(Ct2) for every t0,j[d]. Suppose, furthermore, that, for every M > 0, the distribution of X(j), conditional on |X(j)|M, is centered: E[X(j)||X(j)|M]=0. Let Yi=f(W*;Xi),1iN be the corresponding label generated by a planted teacher network per (1), where rank(W*)=d and W*FdK. Then, for some absolute constants C3,C>0 with probability at least

1exp(Cd)(9d4K+9)d21·exp(C3Nd44K)NdeCd,
it holds that
minWRm×d:rank(W)d1 L^(W)minWRm×d:rank(W)d11N1iN(Yif(W;Xi))212C5σmin(W*)4.

Here,

C5=min{μ4(1/2)μ2(1/2)2,2μ2(1/2)2},whereμt(1/2)=E[X(j)t||X(j)|d1/2].

Furthermore, if the dimension d of data is constant (d=O(1)), then with probability 1O(1/N),

minWRm×d:rank(W)d1 L^(W)12C5¯σmin(W*)4,
where
C5¯=min{μ4μ22,2μ22}andμn=E[X(j)n].

The proof of Theorem 2 is provided in Section 4.9. Several remarks are now in order.

Assuming d is large, Theorem 2 shows that, with high probability, L^(W) is bounded away from zero by an explicit constant for any W that is rank-deficient provided N=dO(1), where the O(1) term depends on K. Furthermore, provided N is a sufficiently large polynomial-in-d quantity, the probability estimate is of form 1exp(Θ(d)), which is exponential in the dimension.

Note that one indeed needs a finite d correction for the case when the data are low-dimensional, d=O(1): for d=O(1), the term Ndexp(Θ(d)) makes the probability estimate vacuous. The constant C5¯ appearing in this case is precisely the same constant appearing in Theorem 1, and in particular, no conditioning is required. Furthermore, even though we establish the probability estimate to be 1O(1/N) for simplicity, it can be improved: it appears, from our analysis (which uses Chebyshev’s inequality), that for any α>0, one can show the probability estimate to be 1O(Nα). Furthermore, using more elaborate tools (such as concentration for heavy-tailed variables, in particular, for i.i.d. averages of fourth moments of sub-Gaussian variables), this estimate can potentially be improved to 1exp(c¯0Nc0) for suitable constants c0,c¯0>0. Finally, our analysis yields also that the 1O(1/N) probability estimate still remains valid even when Xi has centered i.i.d coordinates with a finite eight moment. That is, the sub-Gaussianity assumption can be relaxed. Indeed, the expression L^(W) is an i.i.d. sum of the form N11iN(XiTMXi)2 for a suitable matrix M, and one needs the finiteness of E[(XTMX)4] to apply Chebyshev’s inequality: this quantity is finite provided E[Xi(j)8]< and MF=O(1). We do not pursue these extensions for keeping the presentation clear.

We now discuss the assumption that the coordinates of Xi are conditionally centered. This assumption is not inherent to the problem; it is adopted purely for technical reasons. Toward Theorem 2, one needs to establish the existence of an analogous energy barrier for a single rank-deficient ARd×d and should in particular study an event of form

EA={1N1iN(YiXiTAXi)212C5(K1)σmin(W*)4},
where C5(K1) is a certain constant; see Lemma 2. To control P[EA], we first condition on an event of the form {maxiXidK1} on which the coordinates of Xi are bounded, and we then apply Hoeffding’s inequality to obtain probabilistic guarantees that are exponential in the sample size N. Here, the fact that the coordinates of Xi are conditionally centered makes Theorem 1 applicable; one can then deduce that the conditional mean of (XTAXXT(W*)TW*X)2 is at least C5(K1)σmin(W*)4. Without this assumption, it appears difficult to study the aforementioned probability, and one needs more sophisticated tools. One potential approach is to apply Jensen’s inequality to control the difference between the output of the model with the planted model and then to apply Mendelson’s [61] small ball method; such an argument might, in fact, help relaxing distributional assumptions on Xi. (We thank the anonymous reviewer for suggesting this argument.) The assumption that the conditional mean of Xi(j) is zero is benign; it holds, for example, for random variables whose distributions are symmetric around zero. For instance, any normal distribution with zero-mean satisfies this assumption. Furthermore, Theorem 2 still remains valid even when the coordinates of the data have heavier tails: our techniques apply also to the tails of form P(|Xi(j)|>t)exp(Ctα), where C,α>0 are arbitrary. We use α = 2 throughout for simplicity.

We next comment on the dependence on W*F through the constant K. The (value of) energy barrier is independent of this gap, whereas the probability guarantee does depend on it. This is an artifact of our proof technique. Our proof is based on an ϵ-net argument for a set of rank-deficient matrices with bounded AF together with a union bound. The cardinality bound we employ on this net per Lemma 3 depends on dK; consequently, this constant appears in the probability guarantee. Once again, it might be possible to avoid this by using a different argument as outlined earlier.

An inspection of the proofs of Theorems 1 and 2 yield that the landscapes of the corresponding risks still admit an energy barrier even if we consider the same network architecture with planted weight matrix W*Rm×d and quadratic activation function having lower order terms, that is, the activation αx2+βx+γ with α0. In this case, in addition to σmin(W*) and the corresponding moments of the data, the coefficient α also appears in the barrier expression. In particular, Theorem 1 still remains valid with min{μ4μ22,2μ22}·σmin(W*)4 replaced with α2·min{μ4μ22,2μ22}·σmin(W*)4, and Theorem 2 still remains valid with 12C5σmin(W*)4 replaced with α22C5σmin(W*)4.

2.1.2. Global Optimality of Full-Rank Stationary Points.

We now establish that, if W is a full-rank stationary point of the population risk, L(·), then W is necessarily a global minimum.

Theorem 3.

Suppose W*Rm×d with rank(W*)=d. Suppose XRd has centered i.i.d. coordinates with E[Xi2]=μ2,E[Xi4]=μ4, and Var(Xi2)>0. Let WRm×d be a stationary point of the population risk with full-rank, that is, L(W)=E[(f(W*;X)f(W;X))2]=0, and rank(W)=d. Then, W=QW* for some orthogonal matrix Q and that L(W)=0.

The proof of Theorem 3 is deferred to Section 4.6.

Our next result is an analogue of Theorem 3 for the empirical risk, L^(·), and shows that, if Nd(d+1)/2 and W is any full-rank stationary point of the empirical risk, then W is necessarily a global minimum.

Theorem 4.

Let XiRd,1iN, W*Rm×d with rank(W*)=d, and suppose W is a full-rank stationary point of the empirical risk: rank(W)=d, and WL^(W)=0. Then, L^(W)=0. Furthermore, if Xi are i.i.d. random vectors having a jointly continuous distribution and Nd(d+1)/2, then with probability one, W=QW* for some orthogonal matrix QRm×m.

The proof of Theorem 4 is given in Section 4.10.

Note that an implication of Theorems 3 and 4 is that the corresponding losses admit no full-rank saddle points. Namely, the landscape of the corresponding losses has fairly benign properties below the aforementioned energy barrier. We soon show how this implies the convergence of gradient descent in the next section.

2.1.3. Convergence of Gradient Descent.

We now combine Theorems 1 and 3 to obtain the following conclusion on the performance of the gradient descent algorithm for the population risk. Suppose that the algorithm is initialized at a point W0 having a small population risk L(W0), in particular, lower than the smallest risk value achieved by the rank-deficient matrices. Then, with a properly chosen step size, the algorithm converges to a global optimum: it generates a trajectory {Wk}k0 of weights such that limk L(Wk)=minW L(W)=0.

Theorem 5.

Let W0Rm×d be a matrix of weights with the property that

L(W0)<minWRm×d:rank(W)<d L(W).

Define

L=sup{2L(W):L(W)L(W0)},
where by 2L(W), we denote the spectral norm of the matrix 2L(W). Then, L< and the gradient descent algorithm with initialization W0Rm×d and any fixed step size of 0<η<1/2L generates a trajectory {Wk}k0 of weights such that limk L(Wk)=minW L(W)=0.

The proof of Theorem 5 is provided in Section 4.7.

Our next focus is on the performance of the gradient descent algorithm for the empirical risk. By combining Theorems 2 and 4, we obtain the following conclusions. Suppose that the gradient descent algorithm is initialized at a point with a sufficiently small empirical risk, in particular, lower than the smallest risk value achieved by rank-deficient matrices (i.e., the energy barrier), and fix an ϵ>0. Then, with a properly chosen step size, it finds an approximate stationary point W (that is, a WRm×d with a small L^(W)F) in time poly(ϵ1,σmin(W*)1,d) for which the weights WTW are uniformly ϵ-close to the planted weights (W*)TW*, and consequently, the generalization error L(W) is at most ϵ. Furthermore, the algorithm converges to a global optimum of the empirical risk minimization problem minW L^(W), which is zero, thus recovering planted weights because of the absence of spurious stationary points within the set of full-rank matrices.

Theorem 6.

Let ϵ,K>0 be arbitrary. Suppose that XiRd,1iN satisfies the assumptions in Theorem 2; W0Rm×d is a matrix of weights with the property

L^(W0)<12C5σmin(W*)4,
where C5 is the constant defined in Theorem 2 and W*FdK. Define
L=L(W0)sup{2L^(W):L^(W)L^(W0)},
where by 2L^(W), we denote the spectral norm of the (Hessian) matrix 2L^(W) and
ζmin{ϵ·σmin(W*)232d4K+4,ϵ2·σmin(W*)2(C)2d4K+15,ϵ·σmin(W*)2(C)2μ22d4K+16},(4)
where C>0 is some absolute constant, and μ2=E[X(j)2]. Then, with probability at least
1exp(cN1/4)(9d4K+9)d21·exp(C4Nd44K)Ndexp(Cd),
(where c,C,C4>0 are also absolute constants),
  1. For any W with L^(W)<12C5σmin(W*)4, WFdK+1. Moreover, L=poly(d)<.

  2. Running gradient descent algorithm starting from W0 with a step size of 0<η<1/2L generates a full-rank WRm×d with L^(W)Fζ in time poly(ϵ1,σmin(W*)1,d). Furthermore, for this W, L^(W)ϵ.

  3. For W in (b), it holds that WTW(W*)TW*Fϵ, and the generalization error L(W) is at most ϵ provided Nd58/3.

Furthermore, suppose the data dimension d is constant, d=O(1), and W0Rm×d is a matrix of weights with the property L^(W0)<12C5¯σmin(W*)4 for the same constant C5¯ appearing in Theorem 2. Then, for ζ chosen per (4), with probability at least 1O(1/N),

  1. For any W with L^(W)12C5¯σmin(W*)4, it is the case that WF=O(1). Moreover, L=O(1)<.

  2. A running gradient descent algorithm starting from W0 with a step size of 0<η<1/2L generates a full-rank W with L^(W)ζ and L^(W)ϵ.

  3. For any W in (b), max{WTW(W*)TW*F,L(W)}ϵ.

The proof of Theorem 6 is provided in Section 4.11. Several remarks are now in order.

As a simple corollary to (b), we, thus, obtain that, for d large enough, the gradient descent algorithm with initialization W0Rm×d and a step size of 0<η<1/2L generates a trajectory {Wk}k0 of weights such that limk L^(Wk)=minW L^(W)=0. This yields an analogue of Theorem 5 for the empirical risk, L^(·).

Note that, when N is a sufficiently large polynomial-in-d function, the probability estimate for the first part is essentially 1exp(Θ(d)). Furthermore, we note that the exponent 1/4 in the probability estimate and the sample bound d58/3 are required only for part (c) and can potentially be improved. In particular, the exponent can be improved to one for parts (a), (b), and (d).

We now expand on the sample complexity d58/3 per part (c), which appears rather poor. In light of Soltanolkotabi et al. [83], a much smaller sample complexity of O(d2) suffices to obtain an approximate stationary point W. Our work, on the other hand, takes a step further: we control the deviation of WTW from (W*)TW* and, subsequently, bound the generalization error. As we mention earlier, we are unaware of similar guarantees in the literature; the prior work, in fact, seems to focus only on minimizing the training error. This extension is based on a certain technical result established in Emschwiller et al. [33] (see the following for details), and the sample bound is likely to be an artifact of this approach. Having said this, it might though be possible to improve the dependence on dimension d. As a potential avenue, one may consider leveraging techniques from the phase retrieval literature, such as PhaseLift (Candès and Li [13], Candès et al. [15], Demanet and Hand [22], Eldar and Mendelson [32]), or from the matrix sensing literature mentioned earlier. It appears that some of these algorithms solve a certain convex relation, and the fact that the underlying matrix to be recovered is low rank is crucial for this convex program. In our case, W* and, therefore, (W*)TW* are full rank, so it is not clear whether these algorithms would apply as is. In fact, as mentioned earlier, closely related works in the matrix sensing literature address a (variant of) an empirical risk minimization problem though they do not quantify how close the obtained solution and the ground truth are. The connection between our model and the phase retrieval/matrix sensing literature is very intriguing; it is plausible that the algorithms in the latter domain might apply to our case. This direction merits further investigation and is left for future work.

Note again that, analogous to Theorem 2, one needs a correction for the finite d case because the term Ndexp(Θ(d)) makes the probability estimate vacuous for d=O(1). Furthermore, the remarks following Theorem 2 still remain valid. In particular, the choice of 1O(1/N) is for simplicity, and the estimate can be improved almost immediately to 1O(Nα) for any α>0 and to 1exp(c¯0Nc0) for some c0,c¯0>0 using more delicate concentration tools.

We now provide an important remark pertaining to (c): as we show in the proof, provided N grows at least polynomially in d, with probability 1exp(CN1/4) over Xi, it holds that, for any W having a small risk, L^(W), WTW is close to (W*)TW*: WTW(W*)TW*F is small. Consequently, L(W) is small. This is one of the additional technical results of our paper and is achieved by controlling the condition number of a certain matrix whose i.i.d. rows consist of the tensorized data Xi2. The proof uses a recent work analyzing the spectrum of expected covariance matrices of tensorized data (Emschwiller et al. [33, theorem 5.1]). It is worth highlighting that the independence of the coordinates of Xi is crucial for establishing Theorem 6. Our results leverage techniques in Emschwiller et al. [33] and require, in particular, that the coordinates of Xi are samples from an admissible distribution in the sense of Emschwiller et al. [33, definition 2.4, theorem 5.1]; they would no longer be valid if Xi had potentially dependent coordinates.

These results concern the performance of gradient descent assuming the initialization is proper, that is, it is below the aforementioned energy barrier. One can then naturally ask whether such an initialization is indeed possible in some generic context. In the next section, we address this question of proper initialization when the (planted) weights are generated randomly in order to complement Theorems 5 and 6. We establish that such a proper initialization is indeed possible by providing a deterministic initialization guarantee, which, with high probability, beats the aforementioned energy barrier.

Remark 1.

Before we close this section, we remark on the full-rank assumption, rank(W*)=d. This assumption can be relaxed for Theorems 1 and 2, and one can establish a similar energy barrier for all W with rank(W)<rank(W*). For instance, suppose rank(W*)=d1 and rank(W)d2. Using an eigenvalue interlacing argument (see, e.g., Fulton [35, equation 11]), one can subsequently modify the proofs of Theorems 1 and 2 and obtain a similar energy barrier for all W whose rank is strictly below that of planted W*. On the other hand, the full-rank assumption appears necessary for the results regarding stationary points and global optimality (Theorems 3 and 4) and the subsequent convergence guarantees regarding the gradient descent (Theorems 5 and 6). In fact, a similar full-rank assumption to rule out spurious local minima is also employed in the very related work Soltanolkotabi et al. [83, theorem 2.2] discussed earlier. One of their results (Soltanolkotabi et al. [83, theorem 2.1]) seems to avoid this though at the expense of a rather unnatural assumption that the output layer contains at least d positive and d negative entries. Inspecting their proof as well as ours, a full-rank assumption or the assumption that the output layer has d positive and d negative entries seem crucial: to study the gradient of a stationary point and subsequently declare global optimality, it appears necessary to show that the kernel of a certain matrix arising in the gradient calculation is trivial. It would be interesting to see to which extent this assumption can be relaxed or whether it can be avoided altogether; we leave this as an interesting open direction for future work.

2.2. On Initialization: Randomly Generated Planted Weights

Our results in the previous section show that, provided the initialization of the gradient descent method occurs below the critical energy, the algorithm converges to the global minimum. This raises the question whether such an initialization can be found in a constructive way.

In this section, we show that the answer is yes in the setting of randomly generated weights of the ground truth matrix W*. Specifically, we provide a way to properly initialize such networks under the assumption that the (planted) weight matrix W*Rm×d has arbitrary i.i.d. centered entries with unit variance and a finite fourth moment and the data has centered i.i.d. sub-Gaussian coordinates. (It is worth mentioning that, similar to before, the sub-Gaussianity assumption on the data are required only for the case of empirical risk, and the corresponding population risk result holds under a milder distributional assumption; see Theorem 7.) Our result is valid provided that the network is sufficiently overparameterized: m>Cd2 for some large constant C. Note that this implies W* is a tall matrix sending Rd into Rm. The rationale behind this approach is as follows. The value of the risk is determined by the spectrum of ΔWTW(W*)TW* and the moments of the data distribution. Under our randomness assumption, the so-called Wishart matrix (W*)TW* is tightly concentrated around a multiple of the identity if m is sufficiently large. Hence, one can control the spectrum of Δ and, therefore, the loss functions (L and L^(·)) by properly choosing the initialization W.

We now state the main results of this section, starting with the population risk version.

Theorem 7.

Suppose that the data XRd consists of i.i.d. centered coordinates with Var(Xi2)>0 and E[Xi4]<. Recall L(W) from (3).

  1. (Gaussian case) Suppose that the planted weight matrix W*Rm×d has i.i.d. standard normal entries. Let the initial weight matrix W0Rm×d be defined by (W0)i,i=m+4d for 1id and (W0)i,j=0 otherwise; that is, W0TW0=γId with γ=m+4d. Then, provided m>Cd2 for a sufficiently large absolute constant C > 0,

    L(W0)<minWRm×d:rank(W)<d L(W),
    with probability at least 1exp(Ω(d)), where the probability is with respect to the draw of W*.

  2. (General case) Suppose the planted weight matrix W*Rm×d has centered i.i.d. entries with unit variance and a finite fourth moment. Let the initial weight matrix W0Rm×d be defined by (W0)i,i=m for 1id, and (W0)i,j=0 otherwise; that is, W0TW0=mId. Then, provided m>Cd2 for a sufficiently large absolute constant C > 0,

    L(W0)<minWRm×d:rank(W)<d L(W),
    with high probability as d, where the probability is with respect to the draw of W*.

The proof of this theorem is provided in Section 4.8.

Note that, part (a) of Theorem 7 gives an explicit rate for probability in the case when the i.i.d. entries of the planted weight matrix W* are standard normal and is based on a nonasymptotic concentration result for the spectrum of such matrices. The extension in part (b) is based on a semicircle law obtained by Bai and Yin [3].

The corresponding result for the empirical risk is provided as follows.

Theorem 8.

Suppose that the planted weight matrix W*Rm×d has centered i.i.d. entries with unit variance and a finite fourth moment; the (i.i.d.) data XiRd,1iN, has i.i.d. centered sub-Gaussian coordinates (namely, for some C > 0, P(|Xi(j)|>t)exp(Ct2) for any t > 0, 1iN and 1jd) and the W0Rm×d satisfies (W0)ii=m for i[d] and (W0)ij=0 for ij, that is, W0TW0=mIdRd×d. Then, for some absolute constants C,C>0 with probability at least

1exp(CNd5m)Ndexp(Cd)od(1),
it is the case that, for the constant C5 defined in Theorem 2,
L^(W0)<12C5σmin(W*)4
provided m>Cd2 for a sufficiently large constant C>0.

The proof of Theorem 8 is provided in Section 4.12.

It is worth noting that, unlike the earlier Theorems 2 and 6, Theorems 7 and 8 do not have a separate statement for the case of finite d (d=O(1)): in order to ensure the concentration property for the Wishart ensemble takes place, one should consider the regime d. That is, our initialization results do not hold for the regime in which d is constant.

We highlight that, for low-rank models, such as the problem of sensing a low-rank matrix described earlier as well as learning quadratic networks with a low-rank weight matrix W*, prior work proposes spectral initialization (Gunasekar et al. [41], Khodak et al. [52], Li et al. [57], Stöger and Soltanolkotabi [87], Vodrahalli et al. [92]) and analyzes the performance of gradient descent in recovering the planted weights. More concretely, for recovering an unknown rank-r matrix A*Rd×d, these works propose initializations of the form UUT, where URd×r. For our results, in particular, the global optimality of certain stationary points (Theorems 3 and 4) as well as the convergence guarantees for the gradient descent (Theorems 5 and 6), the fact that W* is full rank is crucial. Even though such a low-rank initialization would not suffice for the goal of finding a W0 with loss below the energy barrier, it would be interesting to analyze the performance of such an initialization for a higher number of iterations. If successful, such an initialization would enable us to relax the randomness assumption on W*; we leave this as an interesting future direction.

We now turn our attention to the number of training samples required to learn such models.

2.3. Critical Number of Training Samples

In this section, we explore the smallest number of training data for which a small empirical risk controls the generalization error.

2.3.1. A Necessary and Sufficient Geometrical Condition.

We start by identifying a necessary and sufficient geometrical condition on the data under which any minimizer of the empirical risk has zero generalization error. For our setting, any such minimizer necessarily interpolates the data.

Theorem 9.

Let X1,,XnRd be a set of data. Let (P) be the property that span{XiXiT:1iN}=S, where S is the set of all d × d symmetric real matrices.

  1. Suppose that (P) holds. Then, for any m^N and WRm^×d satisfying f(W*;Xi)=f(W;Xi),i[N], we have WTW=(W*)TW*. In particular, if m^m, then W=QW* for some QRm^×m with QTQ=Im.

  2. Suppose that (P) does not hold. Then, for any W*Rm×d with rank(W*)=d and any m^d, there exists a WRm^×d such that, whereas f(W*;Xi)=f(W;Xi),i[N],WTW(W*)TW*. In particular, almost surely L(W)>0 with respect to any jointly continuous distribution on Rd.

The proof of Theorem 9 is provided in Section 4.13.

Note that the property (P), span{XiXiT:i[N]}=S, is a purely geometrical necessary and sufficient condition; it can be checked ahead of the optimization (not retrospective). Under (P), any minimizer of empirical risk L^ has perfect generalization. Conversely, when (P) does not hold, there are global optima of L^ that (almost surely) have a strictly positive generalization error. Furthermore, Theorem 9 applies to any arbitrary data set Xi (not necessarily random). We highlight the extension that, when L^(W) is positive though small, one can control WTW(W*)TW*F via Theorem 5(c) and bound L(W). For a more refined version of Theorem 9, see Theorem 11.

We next highlight the parameter m^N. Note that, when (P) holds, any network with m^ nodes interpolating the data necessarily has zero generalization error. This holds even in the overparameterized regime, m^m. Namely, once the data are interpolated, overparameterization does not hurt the generalization ability. This is an instance of an empirically observed (and extensively studied) phenomenon regarding neural networks that defies classical statistical wisdom.

Finally, Theorem 9 still holds if each node has an associated positive output weight aj*R+, that is if the network computes jmaj*Wj*,X2. This is easily seen by pushing aj* inside Wj*.

2.3.2. Randomized Data Enjoys the Geometric Condition.

We now identify the smallest number N* of training samples such that random data Xi,i[N] almost surely satisfies (P) for NN*.

Theorem 10.

Let N*=d(d+1)/2 and X1,,XNRd be i.i.d. random vectors with a jointly continuous distribution. Then,

  1. If NN*, then P(span(XiXiT:i[N])=S)=1.

  2. If N<N*, then for arbitrary Z1,,ZNRd,span(ZiZiT:i[N])S.

Theorem 10 is likely folklore; we provide a proof in Section 4.14 for completeness. Observe that part (b) is straightforward: as N*=(d2)+d=dim(S), (P) cannot hold when N<N*.

2.3.3. Sample Complexity Bound for the Planted Network Model.

Combining Theorems 9 and 10, we arrive at the following sample complexity result.

Theorem 11.

Let Xi,i[N] be i.i.d. samples drawn from a jointly continuous distribution on Rd and Yi=f(W*;Xi), where W*Rm×d with rank(W*)=d.

  1. Suppose NN*, and m^N. Then, with probability one over X1,,XN, the following holds. If WRm^×d with f(W;Xi)=Yi,i[N], then f(W;x)=f(W*;x) for all xRd.

  2. Suppose XiRd,i[N] are i.i.d., each with centered i.i.d. entries having variance μ2 and a finite fourth moment μ4. Suppose N<N*. Then, there exists a WRm×d such that f(W;Xi)=Yi,i[N], but L(W)min{μ4μ22,2μ22}·σmin(W*)4.

The proof of Theorem 11 is provided in Section 4.15.

The lower bound in part (b) is very similar to the energy barrier for rank-deficient matrices per Theorems 1(a) and 2. Moreover, the interpolating network in (a) can be overparameterized. Theorem 11 provides the necessary and sufficient number of data points for training a shallow quadratic network so as to guarantee perfect generalization property.

2.3.4. Related Work.

We now discuss a related line of work on polynomial activations that seek necessary and sufficient conditions and/or the smallest sample size under which the loss functions satisfy various favorable properties. Venturi et al. [90] study a key topological property of the loss function, namely, the presence/absence of spurious valleys. Similar to us, they identify necessary and sufficient conditions independent of the data distribution. One of their results (Venturi et al. [90, corollary 9]) shows that the empirical risk L^ admits no spurious valleys provided the network is sufficiently overparameterized: mN, where N is the sample size. (A close variant of this result is in fact shown earlier by Livni et al. [59]; see Venturi et al. [90] for a comparison.) Moreover, they also establish a similar guarantee regarding the population risk for quadratic networks; see Venturi et al. [90, theorem 12]. Similar analysis for quadratic networks is conducted also by Soltanolkotabi et al. [83] and Du et al. [24] mentioned earlier. In particular, Soltanolkotabi et al. [83] establish the absence of spurious local minima for L^ when dNO(d2); Du et al. [24] establish that a regularized version of the empirical risk has no spurious minima when (a) m > d or (b) m(m+1)2n and m < d. Yet another related work by Mannelli et al. [60] show the following when the data has i.i.d. standard normal entries. For n>2d, the only minimizer of the empirical risk is a planted weight matrix, whereas for n<2d, the empirical risk admits spurious local minima, both w.h.p. as n,d.

3. Preliminaries

We collect herein several useful auxiliary results that we employ in our proofs.

3.1. An Analytical Expression for the Population Risk

Toward proving our energy barrier results, Theorems 2 and 1, we start with providing an analytical expression for the population risk L(W) of any WRm×d in terms of how close it is to the planted weight matrix W*Rm×d.

We recall that a random vector X in Rd is defined to have a jointly continuous distribution if there exists a measurable function f:RdR such that, for any i[N] and Borel set BRd,

P(XB)=Bf(x1,,xd)dλ(x1,,xd),
where λ is the Lebesgue measure on Rd.

Theorem 12.

Let W*Rm×d,f(W*;X) be the function computed by (1) and f(W;X) be similarly the function computed by (1) for WRm×d. Recall,

L(W)=E[(f(W*;X)f(W;X))2],
where the expectation is with respect to the distribution of XRd.
  1. Suppose the distribution of X is jointly continuous. Then, L(W)=0, that is, f(W*;X)=f(W;X) almost surely with respect to X if and only if W=QW* for some orthonormal matrix QRm×m. Suppose now that the coordinates of XRd are i.i.d. with E[Xi]=0,E[Xi2]=μ2, and E[Xi4]=μ4.

  2. It holds that

    L(W)=μ22·trace(A)2+2μ22·trace(A2)+(μ43μ22)·trace(AA),
    where A=(W*)TW*WTWRd×d and AA is the Hadamard product of A with itself. In particular, if XRd has i.i.d. standard normal coordinates, we obtain L(W)=trace(A)2+2trace(A2).

  3. The following bounds hold:

    μ22·trace(A)2+min{μ4μ22,2μ22}·trace(A2)L(W),
    and
    μ22·trace(A)2+max{μ4μ22,2μ22}·trace(A2)L(W).

The proof of Theorem 12 is provideed in Section 4.1.

In a nutshell, Theorem 12 states that the population risk L(W) of any WRd is completely determined by how close it is to the planted weights W* as measured by the matrix A=(W*)TW*WTW and the second and fourth moments of the data. This is not surprising: L(W) is essentially a function of the first four moments of the data and the difference of the quadratic forms generated by W and W*, which is precisely encapsulated by the matrix A. Note also that the characterization of the optimal orbit per part (a) is not surprising either: any matrix W with the property W=QW*, where QRm×m is an orthonormal matrix, that is, QTQ=Im, has the property that f(W;X)=WX22=XTWTWX=f(W*;X) for any data XRd. Part (a) then says the reverse is true as well provided that the distribution of X is jointly continuous. Note also that, for X with centered i.i.d. entries, the thesis of part (a) follows also from part (c): L(W)=0 implies that trace(A2)=0, which, together with the fact that A is symmetric, then yields A = 0; that is, WTW=(W*)TW*.

3.2. Useful Lemmas and Results from Linear Algebra and Random Matrix Theory

Our next result is a simple norm bound for the ensemble XiRd,1iN with sub-Gaussian coordinates.

Lemma 1.

Let XiRd,1iN be an i.i.d. collection of random vectors with centered i.i.d. sub-Gaussian coordinates, that is, for some constant C > 0, P(|Xi(j)|>t)exp(Ct2) for every i[N],j[d], and t0. Then,

P(XidK1,1iN)1Ndexp(Cd2K1).

The proof of Lemma 1 is provided in Section 4.3.

Our energy barrier result Theorem 2 for the empirical risk is proven by establishing the emergence of a barrier for a single rank-deficient ARd×d together with a covering numbers argument.

Lemma 2.

Let K1>0 be an arbitrary constant and XiRd,1iN be a collection of i.i.d. data with centered i.i.d. sub-Gaussian coordinates for which for any M > 0, the mean of |X1(1)| conditional on |X1(1)|M is zero, and let Yi=f(W*;Xi) be the corresponding label generated by a neural network with planted weights W*Rm×d as per (1), where W*FdK2. Fix any ARd×d, where AFd2K2,rank(A)d1, and A0. Define the event

E(A){1N1iN(YiXiTAXi)212C5(K1)σmin(W*)4},
where
C5(K1)min{μ4(K1)μ2(K1)2,2μ2(K1)2}
for μn(K)=E[X1(1)n||X1(1)|dK]. Then, there exists a constant C>0 (independent of W and depending only on data distribution, K1, and W*) such that
P(E(A)c|XidK1,1iN)exp(C3Nd4K1+4K2+2).

In particular,

P(E(A))1exp(C3Nd4K1+4K2+2)NdeCd2K1,
where C > 0 is the same constant as in Lemma 1.

The parameter K1 appearing in Lemma 2 controls the amount of truncation applied on training data, and K2 controls the norm of the planted weight matrix. The proof of Lemma 2 is provided in Section 4.4.

The next result is a covering number bound adopted from Candès and Plan [14, lemma 3.1] with minor modifications.

Lemma 3.

Let

SR{ARd×d:rank(A)r,A0,AFR}.

Then, there exists a ϵnet SR¯ for SR in the Frobenius norm (that is, for every ASR, there exists an A^SR¯ such that AA^Fϵ) such that

|SR¯|(9Rϵ)dr+r.

The proof of Lemma 3 is provided in Section 4.5.

Some of our results use the following well-known results. These results are verbatim from the literature and provided herein without proof.

Theorem 13

(Caron and Traynor [16]). Let  be an arbitrary positive integer and P:RR be a polynomial. Then, either P is identically zero or {xR:P(x)=0} has zero Lebesgue measure, namely, P(x) is nonzero almost everywhere.

Theorem 14

(Horn and Johnson [47, Theorem 7.3.11]). For two matrices ARp×n and BRq×n, where qp, ATA=BTB holds if and only if A = QB for some matrix QRp×q with orthonormal columns.

Our results regarding the initialization guarantees use the several auxiliary results from random matrix theory: the spectrum of tall random matrices are essentially concentrated.

Theorem 15

(Vershynin [91, Corollary 5.35]). Let A be an m × d matrix with independent standard normal entries. For every t0, with probability at least 12 exp(t2/2), we have

mdtσmin(A)σmax(A)m+d+t.

Theorem 16

(Bai and Yin [4], Vershynin [91, Theorem 5.31]). Let A=AN,n be an N × n random matrix whose entries are independent copies of a random variable with zero mean, unit variance, and a finite fourth moment. Suppose that the dimensions N and n grow to infinity and the aspect ratio n/N converges to a constant in [0,1]. Then,

σmin(A)=Nn+o(n),andσmin(A)=N+n+o(n),
almost surely.

The following concentration result, recorded herein verbatim from Vershynin [91], is useful for our approximate stationarity analysis.

Theorem 17

(Vershynin [91, Theorem 5.44]). Let A be an N × n matrix whose rows Ai are independent random vectors in Rn with the common second moment matrix Σ=E[AiAiT]. Let m be a number such that Ai2m almost surely for all i. Then, for every t0, the following inequality holds with probability at least 1n·exp(ct2):

1NATAΣmax(Σ1/2δ,δ2)whereδ=tm/N.

Here, c > 0 is an absolute constant.

Finally, we make use of the matrix-operator version of Hölder’s inequality.

Theorem 18

(Bhatia [8]). For any matrix URk×, let Uσp denotes the p norm of the vector

(σ1(U),,σmin{k,}(U))
of singular values of U. Then, for any p,q>0 with 1p+1q=1, it holds that
|U,V|=|trace(UTV)|UσpVσq.

4. Proofs

In this section, we present the proofs of the main results of this paper.

The order of the proofs presented herein is slightly different from the order of the corresponding results in the main body in that none of the following proofs (with one exception that we detail) use a proof presented later than itself. That is, whenever we present the proof of a result, it is ensured that, if this proof requires another result as a building block, this building block is shown earlier. The rationale behind this is to avoid any potential confusion and to ensure that no cyclic reasoning is present.

With this arrangement, only Theorem 4 uses results presented later in this section (more precisely, it uses Theorems 9 and 10), and it can be checked directly that there is no cyclic reasoning in the proof of Theorem 4.

4.1. Proof of Theorem 12

First, we have

f(W;X)f(W*;X)=XT((W*)TW*WTW)XXTAX,(5)
where A=(W*)TW*WTWRd×d is a symmetric matrix. Note also that
trace(A)2=i=1dAii2+2i<jAiiAjj,(6)
and
trace(A2)=trace(ATA)=AF2=i,jAij2=i=1dAii2+2i<jAij2,(7)
where A2 is equal to ATA as A is symmetric.
  1. Recall Theorem 13. In particular, if L(W)=0, then we have P(X)=XTAX=0 almost surely. Because P(·):RdR a polynomial, it then follows that P(X) = 0 identically. Now, because A is symmetric, it has real eigenvalues, called λ1,,λd with corresponding (real) eigenvectors ξ1,,ξd. Now, taking X=ξi, we have XTAX=ξiTAξ=λiξi,ξi=0. Because ξi0, we get λi=0 for any i. Finally, because A=QΛQT, it must necessarily be the case that A = 0. Hence, WTW=(W*)TW*, which implies W=QW* for some QRm×m orthonormal, per Theorem 14.

  2. Using Equation (5), we first have

    L(W)=1i,j,i,jdAijAi,jE[XiXjXiXj].

    Note that, if |{i,j,i,j}|{3,4}, then E[XiXjXiXj]=0 because X has centered i.i.d. coordinates. Keeping this in mind and carrying out the algebra, we then get

    L(W)=i=1dAii2E[Xi4]+2i<jAiiAjjE[Xi2]E[Xj2]+4i<jAij2E[Xi2]E[Xj2]=μ4i=1dAii2+2μ22i<jAiiAjj+4μ22i<jAij2.

    Using now Equations (6) and (7), we get

    L(W)=(μ43μ22)·trace(AA)+μ22·trace(A)2+2μ22·trace(A2),
    because Aii2=(AA)ii.

  3. Define k to be such that μ4μ22=2kμ22, namely, k is related to measures of dispersion pertaining to Xi: 2k is the coefficient of variation, and (2k+1) is the kurtosis associated to the random variable Xi. With this, we have

    L(W)=μ22·trace(A)2+2μ22(ki=1dAii2+2i<jAij2).

From here, the desired conclusion follows because

μ22·trace(A)2+2 min{k,1}μ22(i=1dAii2+2i<jAij2)L(W),
and
μ22·trace(A)2+2 max{k,1}μ22(i=1dAii2+2i<jAij2)L(W),
together with Equation (7).

4.2. Proof of Theorem 1

  1. Note first that using Theorem 12 part (c), we have

    L(W)min{Var(Xi2),2E[Xi2]2}trace(A2).

    Now, fix any WRm×d with rank(W)<d. Let a1ad be the eigenvalues of (W*)TW*, b1bd be the eigenvalues of WTW, and λ1λd be the eigenvalues of (W*)TW*WTW. Because W is rank deficient, we have b1=0. Furthermore, ad=σmin(W*)2 because the eigenvalues of (W*)TW* are precisely the squares of the singular values of W*. Now, recall the (Courant–Fischer) variational characterization of the eigenvalues (Horn and Johnson [47]). If M is a d × d matrix with eigenvalues c1cd, then

    c1=maxx:x2=1 xTMxandcd=minx:x2=1 xTMx.

    With this, fix an xRd with x2=1. Then,

    xT((W*)TW*WTW)xminx:x2=1 xT(W*)TW*x+xT(WTW)x=ad+xT(WTW)x.

    Because this inequality holds for every x with x2=1, we can take the max over all x and arrive at

    λ1=maxx:x2=1 xT((W*)TW*WTW)xad+b1=adσmin(W*)2.

    Now, because λ12,,λd2 are precisely the eigenvalues of A2, we have trace(A2)=i=1dλi2λ12. Hence, for any W with rank(W)<d, it holds that

    L(W)min{Var(Xi2),2E[Xi2]2}λ12.

    Finally, because λ12σmin(W*)4, the desired conclusion follows by taking the minimum over all rank-deficient W.

  2. Let the eigenvalues of (W*)TW* be denoted by λ1*,,λd* with the corresponding orthogonal eigenvectors q1*,,qd*. Namely, diagonalize (W*)TW* as Q*Λ*(Q*)T, where the columns of Q*Rd×d are q1*,,qd* and Λ*Rd×d is a diagonal matrix with (Λ*)i,i=λi* for every 1id. Let

    W¯=j=1d1λj*qj*(qj*)TRd×d.

Observe that W¯TW¯=Q*Λ¯Q*, where Λ¯Rd×d is a diagonal matrix with (Λ¯)i,i=(Λ*)i,i for every 1id1 and (Λ¯)d,d=0 and that rank(W¯)=d1. Now, let W1¯,,Wd¯Rd be the rows of W¯ and fix a j[d] such that Wj¯0.

Having constructed a W¯Rd×d, we now prescribe WRm×d as follows. For 1id,ij, let Wi=Wi¯, where Wi is the ith row of W. Then, set Wj=12Wj¯, and for every d+1im, set Wi=32mdWj¯. For this matrix, we now claim

WTW=W¯TW¯.

To see this, fix an XRd and recall that XTWTWXXTW¯TW¯X=WX22W¯X22. We now compute this quantity more explicitly:

WX22W¯X22=k=1dWk,X2k=1mW¯k,X2=k=1,kjdWk,X2+Wj,X2k=1,kjdWk,X212Wj,X2k=d+1m32mdWj,X2=Wj,X214Wj,X234(md)(md)Wj,X2=0.

Hence, for every XRd, we have

XTWTWX=XTW¯TW¯X.

Now, let Ξ=WTWW¯TW¯. Note that ΞRd×d is symmetric and XTΞX=0 for every XRd. Now, taking X to be ei, that is, the ith element of the standard basis for the Euclidean space Rd, we deduce Ξi,i=0 for every i[d]. For the off-diagonal entries, let X=ei+ej. Then, XTΞX=Ξi,i+Ξi,j+Ξj,i+Ξj,j=0, which, together with the fact that the diagonal entries of Ξ are zero, imply Ξi,j=Ξj,i, namely, Ξ is skew-symmetric. Finally, because Ξ is also symmetric, we have Ξi,j=Ξj,i, which then implies for every i,j[d],Ξi,j=0, that is, Ξ = 0, and thus, WTW=W¯TW¯.

Hence, we have for WRm×d with rank(W)=d1,

WTW(W*)TW*=Q*Λ(Q*)T,
with (Λ)i,i=0 for every 1id1 and (Λ)d,d=λd*. Namely, the spectrum of the matrix A=(W*)TW*WTW contains only two values: zero with multiplicity d − 1 and λd* with multiplicity one. In particular,
trace(A)=λd*andtrace(A2)=(λd*)2.

Using now the upper bound provided by Theorem 12 part (c) yields the desired claim. Therefore, the energy band lower bound is tight up to a multiplicative constant.

4.3. Proof of Lemma 1

For any fixed i[N],j[d], note that using sub-Gaussian property one has P(|Xi(j)|>dK1)exp(Cd2K1); thus, P(i[N],j[d]:|Xi(j)|>dK1)Ndexp(Cd2K1), using union bound, which yields the conclusion.

4.4. Proof of Lemma 2

Let

E1{XidK1,1iN}.

By Lemma 1, P(E1)1Ndexp(Cd2K1). Now, note that

P(E(A)c)=P(E(A)c|E1)P(E1)+P(E(A)c|E1c)P(E1c)P(E(A)c|E1)+N exp(Cd2K1).(8)

We now study P(E(A)c|E1); hence, assume we condition on E1 from now on. The triangle inequality yields

|YiXiTAXi||XiTAXi|+|XiT(W*)TW*Xi|.

Observe now that

XiXiF2=trace(XiXiTXiXiT)=Xi22trace(XiXiT)=Xi24,
which implies (conditional on E1)
XiXiTF=Xi22d2K1+1.

Now, the Cauchy–Schwarz inequality with respect to the inner product U,Vtrace(UTV) yields

|XiTAXi|=A,XiXiTAFXiXiTFd2K1+2K2+1,
for every i[N], using AFd2K2.

Next, let A*=(W*)TW*Rd×d, and let η1*,,ηd* be the eigenvalues of A*, all nonnegative. Observe that

W*F2=trace(A*)=1jdηj*d2K2.

Now, note that (η1*)2,(η2*)2,,(ηd*)2 are the eigenvalues of (A*)2=(A*)TA*. With this reasoning, we have

A*F2=trace((A*)TA*)=trace((A*)2)=1jd(ηj*)2(1jdηj*)2d4K2.

Consequently, A*Fd2K2, and therefore, the exact same reasoning yields

|XiT(W*)TW*Xi|=XiTA*Xid2K1+2K2+1,
for every i[N]. Hence, conditional on E1, it holds that, for every i[N],
(XiTAXiXiT(W*)TW*Xi)24d4K1+4K2+2.

We now apply concentration to i.i.d. sum

1N1iN(XiTAXiXiT(W*)TW*Xi)2,
which is a sum of bounded random variables that are at most 4d4K1+4K2+2.

Now, recalling the distributional assumption on the data, we have that, conditional on XidK1, the data still has i.i.d. centered coordinates. In particular, the energy barrier result for the population risk as per Theorem 1 applies:

E[(XTAXXT(W*)TW*X)2|E1]C5(K1)σmin(W*)4,
where
C5(K1)=min{μ4(K1)μ2(K1)2,2μ2(K1)2},
is controlled by the conditional moments of data coordinates.

Finally applying Hoeffding’s inequality for bounded random variables, we arrive at

1N1iN(XiTAXiXiT(W*)TW*Xi)212C5σmin(W*)4,
with probability at least 1exp(C3Nd4K14K22). Namely,
P(E(A)c|E1)exp(C3Nd4K14K22).

Returning to (8), this yields

P(EA)1exp(C3Nd4K14K22)Ndexp(Cd2K1).

This completes the proof of Lemma 2.

4.5. Proof of Lemma 3

The proof is almost verbatim from Candès and Plan [14, lemma 3.1], and included herein for completeness.

Note that any ARd×d,A0, and rank(A)=r decomposes as A=QΛQT, where QRd×r, satisfying QTQ=Id and ΛRr×r, a diagonal matrix with nonnegative diagonal entries. Notice, furthermore, that AF=ΛFR as Q is orthonormal. With this, we now construct an appropriate net covering the set of all permissible Q and Σ.

Let D be the set of all r × r diagonal matrices with nonnegative diagonal entries with a Frobenius norm at most R. Let D¯ be an ϵ3net for D in the Frobenius norm. Using standard results (see, e.g., Vershynin [91, lemma 5.2]), we have

|D¯|(9Rϵ)r.

Now, let Od,r={QRd×r:QTQ=Id}. To cover Od,r, we use a more convenient norm ·1,2 defined as

X1,2=maxiXi2,
where Xi is the ith column of X. Define Qd,r={XRd×r:X1,21}. Note that Od,rQd,r. Furthermore, observe also that Qd,r has a ϵnet of cardinality at most (3/ϵ)dr. With this, we now take O¯d,r to be a ϵ3Rnet for Od,r. Consider now the set
SR¯{Q¯Λ¯Q¯T:Q¯O¯d,r,Λ¯D¯}.

Clearly,

|SR¯||O¯d,r||D¯|(9R/ϵ)dr+r.

We now claim SR¯ is indeed a ϵnet for SR in the Frobenius norm. To prove this, take an arbitrary ASR, and let A=QΛQT. There exists a Q¯O¯d,r and a Σ¯D¯ such that ΣΣ¯Fϵ/3, and QQ¯1,2ϵ/3R. Now, let A¯=Q¯Σ¯Q¯T. Observe that, using the triangle inequality,

A¯AF=QΛQTQ¯Λ¯Q¯TFQΛQTQ¯ΛQTF+Q¯ΛQTQ¯Λ¯QTF+Q¯Λ¯QTQ¯Λ¯Q¯TF.

For the first term, note that, because Q is orthonormal, (QQ¯)ΛQTF=(QQ¯)ΛF. Next,

(QQ¯)ΛF2=1idΛii2QiQ¯i22QQ¯1,22ΣF2(ϵ/3)2,
using QQ¯1,2ϵ/3R and ΣFR. Thus, QΛQTQ¯ΛQTFϵ/3. Similarly, we also have Q¯Λ¯QTQ¯Λ¯Q¯TFϵ/3. Finally, Q¯ΛQTQ¯Λ¯QTF=ΛQTΛ¯QTF=ΛΛ¯Fϵ/3 using again the facts that Q and Q¯ are both orthonormal. This concludes that A¯AFϵ; thus, |SR¯| is indeed a ϵnet for SR, in the Frobenius norm, of cardinality at most (9R/ϵ)dr+r.

As a side remark, observe that we gain an extra factor of two in the exponent owing to the fact that A is positive semidefinite (otherwise, the bound would be (9R/ϵ)2dr+r).

4.6. Proof of Theorem 3

We first establish the following proposition for any W, which is a stationary point of the population risk.

Proposition 1.

Let D*Rd×d be a diagonal matrix with Dii*=((W*)TW*)ii and define DRd×d analogously. Then, WRm×d enjoys the stationarity equation

(μ43μ22)WD*+μ22WW*F2+2μ22(W(W*)TW*)=(μ43μ22)WD+μ22WWF2+2μ22(W(WTW)).

To that end, fix a k0[m] and 0[d]. Note that k0,0L(W)=E[k0,0(f(W*;X)f(W;X))2], using the dominated convergence theorem. Next, E[k0,0(f(W*;X)f(W;X))2]=0 implies that, for every k0[m] and 0[d],

j=1mE[Wj*,X2Wk0,XX0]=j=1mE[Wj,X2Wk0,XX0].

Note next that j=1mE[Wj*,X2Wk0,XX0] computes as

μ4j=1m(Wj,0*)2Wk0,0+μ22j=1m1d,0Wk0,0(Wj,*)2+2μ22j=1m1d,0Wk0,Wj,*Wj,0*.

We now put this object into a more convenient form. Notice that the preceding expression is

(μ43μ22)Ak0,0+μ22Bk0,0+2μ22Ck0,0,
where
Ak0,0=Wk0,0j=1m(Wj,0*)2andBk0,0=j=1m=1dWk0,0(Wj,*)2andCk0,0=j=1m=1dWk0,Wj,0*Wj,*.

Observe that Bk0,0=Wk0,0W*F2. We now study Ak0,0 and Ck0,0 more carefully. Observe that j=1m(Wj,0*)2=((W*)TW*)0,0. Now, let D*Rd×d be a diagonal matrix in which (D*)ij=((W*)TW*)ii if i = j and zero otherwise. We then have Ak0,0=(WD*)k0,0.

We now study Ck0,0. Recall that Wi* is the ith row W*. Observe that j=1mWj,0*Wj,*=((W*)TW*)0,. Hence,

j=1m=1dWk0,Wj,0*Wj,*==1dj=1mWk0,Wj,0*Wj,*==1dWk0,((W*)TW*)0,=(W((W*)TW*))k0,0,
that is, Ck0,0=(W((W*)TW*))k0,0. Combining everything, we have that, for every k0[m] and 0[d],
j=1mE[Wj*,X2Wk0,XX0]=(μ43μ22)(WD*)k0,0+μ22Wk0,0W*F2+2μ22(W((W*)TW*))k0,0.

In particular, stationarity yields

(μ43μ22)WD*+μ22WW*F2+2μ22(W((W*)TW*))=(μ43μ22)WD+μ22WWF2+2μ22W(WTW),(9)
where the d × d diagonal matrix D is defined as Dii=(WTW)ii, and entry-wise equalities are converted into equality of two matrices by varying k0[m] and 0[d]. □

Having now established Proposition 1 for the stationarity equation, we now study its implications for any full-rank W.

Let WRm×d be a stationary point with rank(W)=d. We first establish WF=W*F. Because WRm×d is a stationary point, it holds that, for every (k0,0)[m]×[d],k0,0L(W)=0. In particular, Equation (9) holds.

Recalling now that W is full rank, it follows from the rank-nullity theorem that ker(W) is trivial, that is, ker(W)={0}. Hence, for matrices M1, M2 (with matching dimensions), whenever WM1 = WM2 holds, we deduce M1 = M2 because each column of M1M2 is contained in ker(W). Thus, Equation (9) then yields

(μ43μ22)D*+μ22W*F2Id+2μ22(W*)TW*=(μ43μ22)D+μ22WF2Id+2μ22WTW.(10)

Next, note that trace(D*)=i=1d((W*)TW*)ii=trace((W*)TW*)=W*F2, and similarly, trace(D)=WF2. In particular, taking traces of both sides in Equation (10), we get

(μ4μ22)W*F2+μ22dW*F2=(μ4μ22)WF2+μ22dWF2,
implying that W*F2=WF2. Incorporating this into Equation (10), we then arrive at
(μ43μ22)D*+2μ22(W*)TW*=(μ43μ22)D+2μ22WTW.

Now, suppose i[d]. Note that inspecting the preceding (i, i) coordinate, we get

(μ43μ22)((W*)TW*)ii+2μ22((W*)TW*)ii=(μ43μ22)(WTW)ii+2μ22(WTW)ii.

Because μ4μ22=Var(Xi2)>0, we then get

((W*)TW*)ii=(WTW)ii.

Now, focus on off-diagonal entries by fixing ij. Observe that, because Var(Xi2)>0, it also holds that E[Xi2]=μ2>0. Now, note that, Dij*=Dij=0 in this case. We then have

2μ2((W*)TW*)ij=2μ2(WTW)ij(W*)TW*=WTW.

We conclude that the matrix (W*)TW*WTW is a zero matrix. Hence, W=QW* for some orthonormal QRm×m per Theorem 14, and L(W)=0.

4.7. Proof of Theorem 5

Let {Wt}t0 be a sequence of m × d matrices corresponding to the weights along the trajectory of gradient descent, that is, WtRm×d is the weight matrix at iteration t of the algorithm. We first show L<. To see this, recall Theorem 12(c): L(W)μ22·trace(A)2, where trace(A)=WF2W*F2. In particular, this yields μ22(WF2W*F2)2L(W). Hence, for any W with L(W)L(W0), it holds that

WF(L(W0)μ2+W*F2)1/2<.

Namely, the (Frobenius) norm of the weights of any W with L(W)L(W0) remains uniformly bounded from above. This, in turn, yields that the (spectral norm of the) Hessian of the objective function remains uniformly bound from above for any such W because the objective is a polynomial function of W, which is precisely what we denote by L.

We now run gradient descent with a step size of η<1/2L: a second order Taylor expansion reveals that

L(W1)L(W0)ηL(W0)22/2,
where L(W) is the gradient of the population risk, evaluated at W.

In particular, L(W1)L(W0), and furthermore, 2L(W1)L, where 2L(W) is the spectral norm of the Hessian matrix 2L(W). From here, we induct on k: the induction argument reveals that we can retain a step size of η<1/2L, and furthermore, we deduce that the gradient descent trajectory {Wk}k0 is such that (i) L(Wk)L(Wk+1) for every k0, and furthermore, (ii) it holds for every k0 that

L(Wk+1)L(Wk)ηL(Wk)22/2.

We now establish that L(Wk)20 as k. Note that the objective function is lower bounded (by zero). If the gradient is nonvanishing, then (by passing to a subsequence if necessary) each step reduces the value of the objective function at least by a certain amount, that is (uniformly) bounded away from zero. But this contradicts the fact that the objective is lower bounded. Thus, we deduce

limkL(Wk)2=0.

Now, recall that the trajectory is such that L(Wk)L(Wk+1) and L(Wk)20 as k. Suppose that the initial value, L(W0), is such that

L(W0)<minWRm×d:rank(W)<d L(W).

In particular, for every kZ+,

L(Wk)L(W0)<minWRm×d:rank(W)<d L(W).(11)

Therefore, WkRm×d is full rank for all k, per Theorem 1. We now establish

limk L(Wk)=0.

To see this, observe that the sequence {L(Wk)}k0 is monotonic (nonincreasing) and, furthermore, is bounded by zero from below. Hence,

limk L(Wk)
exists (Rudin [76, theorem 3.14]. We now show =0.

Because the weights remain bounded along the trajectory, it follows that there exists a subsequence {Wkn}nN with a limit, that is, WknW as n, where WRm×d. Now, the continuity of L, together with the continuity of the norm ·2, imply that L(W)2=0. Furthermore, continuity of L(·) then implies L(W)=. Now, because Wkns are such that L(Wkn)L(W0) for all nN and L(W0) is strictly smaller than the rank-deficient energy barrier, by taking limits as k and using (11), we conclude that W is full rank. Because W is also a stationary point of the loss, by Theorem 3, we deduce L(W)=0, which yields =0 as desired.

4.8. Proof of Theorem 7

4.8.1. Part (a).

Let t=d. Then, using Theorem 15, it holds that, with probability 12 exp(d/2),

m2dσmin(W*)σmax(W*)m+2dm+4d4mdλmin((W*)TW*)λmax((W*)TW*)m+4d+4md.

Recall that σ(A) denotes the spectrum of A, that is, σ(A)={λ:λ is an eigenvalue of A}. We claim then the spectrum of γIA is γσ(A). To see this, simply note the following line of reasoning:

γλσ(γIA)det((γλ)I(γIA))=0det(λIA)=0λσ(A).

Now, let W0Rm×d be such that W0TW0=γI with γ=m+4d. In particular, if λ1λd are the eigenvalues of γI(W*)TW* with γ=m+4d; then, it holds that

4mdλ1λd4md.

Now, recall by Theorem 12(c) that

L(W0)μ22(i=1dλi)2+max{Var(Xi2),2E[Xi2]2}(i=1dλi2),
where σ(W0TW0(W*)TW*)={λ1,,λd}. For the second term, we immediately have i=1dλi216md2.

For the first term, note first that, if λ1λd are the eigenvalues of (W*)TW*, then

k=1dλk=trace((W*)TW*)=i=1mj=1d(Wij*)2k=1d(λkm)=i=1mj=1d((Wij*)21),
where Wij*=dN(0,1) i.i.d. Note also that, (Wij*)21 is a centered random variable and has a subexponential tail; see Vershynin [91, lemma 5.14]. Now, letting Zij=(Wij*)21 and applying the Bernstein-type inequality (Vershynin [91, proposition 5.16]), we have that, for some absolute constants K,c>0, it holds
P(|i=1mj=1dZij|>dm)2 exp(c min(dK2,dmK))2 exp(cd/K2)=exp(Ω(d)),
for m sufficiently large. In particular, with probability at least 1exp(Ω(d)), it, therefore, holds that
|k=1d(λkm)|dm.

Finally, using the triangle inequality,

|k=1dλk|=|k=1d(λk(m+4d))||k=1d(λkm)|+4d2dm+4d2,
with probability 1exp(Ω(d)). After squaring, we obtain that (i=1dλi)216d4+8d3m+d2m. In particular, we get
L(W0)μ22(i=1dλi)2+max{Var(Xi2),2E[Xi2]2}(i=1dλi2)μ22(16d4+8d3m+md2)+max{Var(Xi2),2E[Xi2]2}16md2.

Using now the overparameterization m>Cd2, we further have

E[Xi2]2(16d4+8d3m+md2)+max{Var(Xi2),2E[Xi2]2}16md2C(C)m2,
where
C(C)=E[Xi2]2(16C2+8C3/2+1C)+16Cmax{Var(Xi2),2E[Xi2]2}.

Note that, for the constant C(C),

C(C)0asC.

Next, observe that, m2d12m for m large (in the regime m>Cd2 with C large enough). Thus, using what we establish in Theorem 1, we arrive at

minWRm×d:rank(W)<d L(W)>min{Var(Xi2),2E[Xi2]2}σmin(W*)4min{Var(Xi2),2E[Xi2]2}(m2d)4116min{Var(Xi2),2E[Xi2]2}m2.

Finally, observe also that, if Var(Xi2)>0, then E[Xi2]>0 as well: indeed observe that, if E[Xi2]=0, then Xi = 0 almost surely, for which Var(Xi2)=0. In particular, min{Var(Xi2),2E[Xi2]2}>0. Equipped with this, we then observe that provided

116min{Var(Xi2),2E[Xi2]2}>C(C)=E[Xi2]2(16C2+8C3/2+1C)+16Cmax{Var(Xi2),2E[Xi2]2},
that is, provided C > 0 is sufficiently large, we are done.

4.8.2. Part (b).

Note that the result of Bai and Yin [3] asserts that, if {μ1,,μd} are the eigenvalues of

A12md((W*)TW*mId),
and if we define the empirical measure
FA(x)=1d|{i:μix}|,
then in the regime d+,d/m0, it holds that
FA(x)ω(x),
almost surely, where ω(x) is the semicircle law, and moreover,
1di=1dμi2x2dω(x)χ2,
namely, χ2 is, respectively, the second moment under semicircle law, w.h.p. Now, define the same quantities as in proof of part (a), where this time W0TW0=mId, and {λ1,,λd}=σ((W*)TW*mId). In particular, we still retain the inequality per Theorem 12(c):
L(W0)μ22(i=1dλi)2+max{Var(Xi2),2E[Xi2]2}(i=1dλi2).

Note that λi=2mdμi. Hence, we obtain

i=1dλi2<(4+o(1))md2χ2
w.h.p. We now control i=1dλi using the central limit theorem (CLT). Observe that
i=1dλi=trace((W*)TW*mId)=i=1mj=1d((Wij*)21).

Now, note that

σ*2Var((Wij*)21)=Var((Wij*)2)<E[(Wij*)4]<.

We now use CLT as d and m/d. To that end, let 1/2>ϵ>0 be fixed. Observe now that, for any arbitrary M > 0, and sufficiently large d,

{11σ*mddϵi=1mj=1d((Wij*)21)1}{M1σ*mdi=1mj=1d((Wij*)21)M}.

In particular, using the central limit theorem, we deduce

lim infd P(11σ*mddϵi=1mj=1d((Wij*)21)1)P(Z[M,M]),
where Z is a standard normal random variable. Now, because M > 0 is arbitrary, by sending M+, we obtain
lim infd P(11σ*mddϵi=1mj=1d((Wij*)21)1)1,
and we then conclude
limd P(11σ*mddϵi=1mj=1d((Wij*)21)1)=1.

Hence,

|i=1dλi|σ*mddϵ,
with probability 1od(1) for d sufficiently large.

Moreover,

σmin(W*)4116m2,
for m large, using yet another result of Bai and Yin, see Theorem 16. From here, carrying the exact same analysis as in part (a), we obtain that, provided m>Cd2 for some large constant C > 0 and d sufficiently large, the following holds with probability 1od(1):
L(W0)<minWRm×d:rank(W)<d L(W),
where W0 is prescribed such that W0TW0=mId.

4.9. Proof of Theorem 2

First, let

S1{WRm×d:rank(W)<d,L^(W)<12C5σmin(W*)4}.

We start with the following claim.

Claim 1.

In the setting of Theorem 2, the following holds. With probability at least 1exp(Cd) (where C>0 is some absolute constant), it holds that, for any WRm×d with L^(W)12C5σmin(W*)4,

WFdK+1,
provided NCd for some absolute constant C>0.

Proof of Claim 1.

For convenience, let L^012C5σmin(W*)4, and for the random data vector X=(X1,,Xd)Rd, let σ2=E[X12]. Recall that X has i.i.d. centered coordinates with a sub-Gaussian coordinate distribution.

We have the following, in which the implication is due to Cauchy–Schwarz:

L^01N1iN(Yif(Xi;W))2(L^0)1/2|1N1iN(Yif(Xi;W))|.

We now establish that, with probability at least 12 exp(t2d), the following holds provided NC(t/ϵ)2d: for every WRm×d,

|1N1iNXiTWTWXiσ2WF2|ϵσ2WF2.

To see this, we begin by noticing XiTWTWXi=trace(XiTWTWXi)=WTW,XiXiT. Using this, we have

|1N1iNXiTWTWXiσ2WF2|=|WTW,1N1iNXiXiTσ2Id|.

We now use Hölder’s inequality (Theorem 18) with p=1,q=,U=WTW and V=1NiXiXiTσ2Id. This yields

|WTW,1N1iNXiXiTσ2Id|WF21N1iNXiXiTσ2Id.

Observing now E[XiXiT]=σ2Id, we have

1N1iNXiXiTσ2Idϵσ2
with probability at least 12 exp(t2d) provided NC(t/ϵ)2d, using the concentration result on sample covariance matrix from Vershynin [91, corollary 5.50]. Hence, on this high probability event, the following holds:
1N1iNXiT(W*)TW*Xiσ2(1+ϵ)W*F2and1N1iNXiTWTWXiσ2(1ϵ)WF2.

Hence,

L^01N1iN(XiTWTWXiXiT(W*)TW*Xi)σ2(1ϵ)WF2σ2(1+ϵ)W*F2.

This yields, for any W with L^(W)L^0,

WF((L^0)1/2σ2(1ϵ)+1+ϵ1ϵW*F2)1/2
with probability at least 12 exp(t2d). Now, observe that
σmin(W*)2=λmin((W*)TW*)trace((W*)TW*)W*F2d2K.

Furthermore, C5=O(1). This yields

L^0=12C5σmin(W*)4=O(d4K).(12)

We now take ϵ=1/2 and conclude that

WF((L^0)1/2σ2(1ϵ)+1+ϵ1ϵW*F2)1/2dK+1
for d large enough with probability at least 12 exp(t2d), which is 1exp(Cd) for some absolute constant C>0. □

Having established Claim 1, we now return to the proof of Theorem 2. Let

S2{WRm×d:rank(W)<d,L^(W)<12C5σmin(W*)4,WFdK+1}.

A consequence of Claim 1 is that P(S1=S2)1exp(Cd). We now establish the following.

Claim 2.

P(S2=Ø)1(9d2+4K+7)d21·exp(C3Nd44K)NdeCd.

Note that combining Claims 1 and 2 through a union bound yields

infWRm×d:rank(W)<d L^(W)12C5σmin(W*)4,
with probability at least
1exp(Cd)(9d9+4K)d21exp(C3Nd44K)NdeCd,
therefore establishing Theorem 2.

Proof of Claim 2.

Let A=WTWRd×d. We claim AFd2K+2. To see this, note that AF2=trace(ATA)=trace(A2). Let θ1,,θd be the eigenvalues of A, all nonnegative as A0, and θ12,,θd2 are the eigenvalues of A2. With this,

trace(A2)=1idθi2(1idλi)2=trace(A)2.

Hence, AFtrace(A)=WF2d2K+2 as requested.

Next, let

SR={ARd×d:rank(A)d1,A0,AFR};
let S¯ϵ be a ϵnet for Sd2K+2 in the Frobenius norm, where ϵ is to be tuned appropriately later. Using Lemma 3, we have
|S¯ϵ|(9d2K+2ϵ)d21.

Now, applying Lemma 2 with K1=12 and K2=K and taking a union bound across the net S¯ϵ, we arrive at the following conclusion:

P(AS¯ϵ{1N1iN(YiXiTAXi)2<12C5σmin(W*)4}E(A)c from Lemma 2|Xid,1iN)(9d2K+2ϵ)d21exp(C3Nd4+4K),
where
C5=min{μ4(1/2)μ2(1/2)2,2μ2(1/2)2}andμn(K)=E[Xin||Xi|dK].

Now, because P(Xi<d,1iN)1Nd exp(Cd) by Lemma 1, we obtain

P(AS¯ϵ{1N1iN(YiXiTAXi)212C5σmin(W*)4})1(9d2K+2ϵ)d21·exp(C3Nd4+4K)Ndexp(Cd).

In the remainder of the proof, suppose for every AS¯ϵ,

1N1iN(YiXiTAXi)212C5σmin(W*)4,
and Xid1/2,i[N], which collectively hold with probability at least
1(9d2K+2ϵ)d21·exp(C3Nd4+4K)2Ndexp(Cd).

Now, let WRm×d with WFdK+1,rank(W)d1. Let A=WTW (thus, AFd2K+2) and A^S¯ϵ be such that AA^Fϵ. We now estimate

Δ|1N1iN(YiXiTAXi)21N1iN(YiXiTA^Xi)2|.

For notational convenience, let A*=(W*)TW*. Now,

Δ1N1iN|(XiT(AA*)Xi)2(XiT(A^A*)Xi)2|=1N1iN|XiT(AA^)Xi|·|XiT(A+A^2A*)Xi|.

Now, using Cauchy–Schwarz (for inner product M,Ntrace(MTN)),

|XiT(AA^)Xi|=|AA^,XiXiT|AA^F·Xi22,
using XiXiTF=Xi22. In particular, we obtain
|XiT(AA^)Xi|ϵd2.

For the term |XiT(A+A^2A*)Xi|, we observe that the triangle inequality yields

A+A^2A*F4d2K+2.

Thus,

|XiT(A+A^2A*)Xi|4d2K+4.

Using these, we obtain

|L^(W)1N1iN(YiXiTA^Xi)2|4ϵd6+2K=O(d1)=od(1),
taking ϵ=d72K. Using, finally, the fact that
1N1iN(YiXiTA^Xi)2
is bounded away from zero across the net S¯ϵ, we conclude the proof of Claim 2. □

Because it is already noted that Claims 1 and 2 together yield Theorem 2, we complete the proof of Theorem 2.

4.9.1. Case of Constant d: d=O(1).

We now carry out a separate analysis for the case of constant d (d=O(1)). We only point out the necessary modifications, hiding factors depending on the constant d under asymptotic notations.

In what follows, we use the fact that, if X is a sub-Gaussian random variable, then E[|X|p]< for every p1. For a proof, see Vershynin [91, lemma 5.5], which establishes a stronger conclusion that E[|X|p]1/p=O(p) for every p1.

4.9.2. Modifying Claim 1.

Claim 1 modifies to the following: with probability at least 1O(1/N), it holds that, for any WRm×d with L^(W)12C5¯σmin(W*)4,

WF=O(1).

We now sketch the proof of this modified claim. For a matrix MRd×d, denote by Mmax1i,jd|Mij| and let ϵ>0 be arbitrary. Then, we show that, over the randomness in XiRd,1iN with probability at least 1O(1/N),

1N1iNXiXiTσ2Idϵ.

Indeed, fix an ϵ>0. Then, for any 1jd,

P(|1N1iNXi(j)2σ2|ϵ)1O(1/N).

This is by Chebyshev’s inequality (here, Xi=(Xi(j):1jd)Rd). Here, we used, in particular, the fact that E[Xi(j)4]<. Likewise, for any 1j<jd,

P(|1N1iNXi(j)Xi(j)|ϵ)1O(1/N).

Taking a union bound over these d(d+1)/2 events corresponding to the entries of matrix N11iNXiXiT, we are done. (Note that the number of events, d(d+1)/2, is O(1). This, and other factors depending on ϵ are hidden under O(·).) Using the trivial bound MMF2, valid for any matrix MRd×d, we arrive at the operator norm bound

1N1iNXiXiTσ2Idϵ2d2.

Finally, taking ϵ=1/(2d2)=O(1), we finish the proof of modified Claim 1.

4.9.3. Modifying Claim 2.

Claim 2 now modifies to P(S2=Ø)1O(1/N), and this modified version is shown as follows. Note that, for any ϵ=O(1), the size of the net we consider is O(1). We claim that, with probability 1O(1/N), it holds that, for any AS¯ϵ,

1N1iN(YiXiTAXi)223C5¯σmin(W*)4,
where
C5¯=min{μ4μ22,2μ22}andμn=E[X1(1)n].

To show this, fix an AS¯ϵ. Now, instead of Lemma 2, one can apply Chebyshev’s inequality:

1N1iN(XiTAXiXiT(W*)TW*Xi)223E[(XTAXXT(W*)TW*X)2],
with probability at least 1O(1/N). Here, we, in particular, used the fact E[Xi(j)8]<. Because
E[(XTAXXT(W*)TW*X)2]C5¯σmin(W*)4
by Theorem 1, we establish the claim by taking a union bound over the net S¯ϵ, which has O(1) cardinality. The rest of the argument for Claim 2 remains (nearly) intact. In particular, the bound AA^Fϵ remains intact, and A+A^2A*F is now O(1). Finally, keeping in mind that 1N1iNXi24=O(1) with probability 1O(1/N), we complete the proof by taking ϵ small enough.

Putting these together as in the proof of Theorem 2, we complete the proof.

4.10. Proof of Theorem 4

We start by computing L^(W). Taking derivatives with respect to the jth row Wj of WRm×d, we arrive at

WjL^(W)=4N1iN(1jmWj,Xi2Yi)Wj,XiXi.

Interpreting these gradients as a row vector and aggregating into a matrix, we then have

WL^(W)=W(4N1iN(1jmWj,Xi2Yi)XiXiT).

Assume now that rank(W)=d, and L^(W)=0. We then arrive at

1N1iN(1jmWj,Xi2Yi)XiXiT=0.

We now claim that L^(W)=0. To see this, we take a route similar to Soltanolkotabi et al. [83, lemma 6.1]. Let MWTW, and consider the function

f(M)1N1iN(YiXiTMXi)2.

Observe that f(·) is quadratic in M. Thus, for any M^ with f(M^)=0, that is,

1N1iN(XiTM^XiYi)XiXiT=0,
it is the case that M^ is a global optimum of f. In particular, for any MRd×d,f(M)f(M^). Now, take any W¯Rm×d, and observe that L^(W¯)=f(W¯TW¯). Because f(WTW)=0, it follows that
L^(W¯)=f(W¯TW¯)f(WTW)=L^(W).

Namely, W is indeed a global optimizer of L^(·). Because W=W* makes the cost zero, we obtain L^(W)=0.

Now, using Theorem 10, we obtain that span(XiXiT:1iN) is the set of all d × d symmetric matrices with probability one provided Nd(d+1)/2. In this case, using Theorem 9, we conclude that WTW=(W*)TW*, concluding the proof.

4.11. Proof of Theorem 6

4.11.1. Part (a).

Note that, by Claim 1, it follows that, with probability at least 1exp(Cd), it is the case that, for any W with L^(W)L^(W0)<12C5σmin(W*)4,WFdK+1. Now, let

E1{supW:L^(W)L^0WFdK+1};(13)
thus, P(E1)1exp(Cd) and
E2{Xid1/2,1iN},(14)
such that P(E2)1Nd exp(Cd) per Lemma 1.

Note that the 2L^(W)=poly(WF,X1,,XN). Thus, on the event E1E2, which holds with probability at least 1Nd exp(Cd)exp(Cd), we have that

L=sup{2L^(W):L^(W)L^0}=poly(d)<+
as claimed.

4.11.2. Part (b).

Suppose that the event E1E2 (where E1 and E2 are defined, respectively, in (13) and (14)) takes place. We run the gradient descent with a step size of η<1/2L. A second order Taylor expansion reveals that

L^(W1)L^(W0)ηL^(W0)F2/2,
where L^(W) is the gradient of the empirical risk evaluated at W. In particular, L^(W1)L^(W0). Because E1 takes place, we conclude 2L^(W1)L=poly(d), where 2L^(W) is the spectral norm of the Hessian matrix 2L^(W). From here, we induct on k; the induction argument reveals that we can retain a step size of η<1/2L (thus, η=poly(d)), and furthermore, along the trajectory {Wk}k0, it holds that
L^(Wk+1)L^(Wk)ηL^(Wk)F2/2.

Now, let T be the first time for which L^(W)Fζ, namely, the horizon required to arrive at an ζstationary point. In what follows, we carry out our analysis in terms of ζ. At the end, we incorporate the bound (4) on ζ.

We claim T=poly(ζ1,d,σmin(W*)1).

To see this, note that, from the definition of T, it holds that L^(Wt)Fζ as tT1. Now, a telescoping argument together with η=1/poly(d) reveals

L^(WT)L^(W0)T(poly(d))1ζ2.

Using now L^(WT)0, we conclude L^(W0)Tζ2poly(d). Because L^(W0)=L^0 is at most polynomial in d as per (12), we conclude T=poly(ζ1,d).

We now turn our attention to bounding its risk. Let riYiXiTWTWXi. Note that L^(W)=1N1iNri2. Now,

L^(W)=1N1iNri(XiT(W*)TW*XiXiTWTWXi)=WTW(W*)TW*,1N1iNriXiXiT.

Using the Cauchy–Schwarz inequality, we have

L^(W)=|WTW(W*)TW*,1N1iNriXiXiT|WTW(W*)TW*F·1N1iNriXiXiTF.

Next, WTWF2=trace((WTW)2)(trace(WTW))2=WF4, using the fact that WTW0. In particular, on the event E1 defined as per (13), we conclude that WFdK+1, and therefore, WTWFd3. This, together with W*FdK and the triangle inequality, then yields

WTW(W*)TW*F2d2K+2,
with probability at least 1exp(Cd). Hence, on this event,
L^(W)2d2K+21N1iNriXiXiTF.(15)

With this, we now turn our attention to bounding

1N1iNriXiXiTF.

We establish that, for the event

E3{infWRm×d:σmin(W)<12σmin(W*)WFdK+1 L^(W)12C5σmin(W*)4},(16)
it is the case that
P(E3)1(9d4K+9)d21·exp(C4Nd44K)Nd exp(Cd).(17)

This is almost a straightforward modification of the proof of the earlier energy barrier result Theorem 2, and we only point out required modifications. Take any WRm×d with σmin(W)<12σmin(W*). In particular,

λmin(WTW)=σmin(W)2<14σmin(W*)2.

Inspecting now the proof of Theorem 1(a), we obtain that, for such a W,

E[(XTWTWXXT(W*)TW*X)2|Xd1/2]34C5σmin(W*)4,
and consequently, modifying Lemma 2, we have that
P(1N1iN(YiXiTWTWXi)212C5σmin(W*)4|Xid,1iN)1exp(CNd44K).

Using now a covering numbers bound, in the same manner as in the proof of Theorem 2, we conclude that

infWRm×d:σmin(W)<12σmin(W*)WFdK+1 L^(W)12C5σmin(W*)4
with probability at least
1(9d4K+9)d21·exp(C4Nd44K)Nd exp(Cd).

Now, suppose in the remainder of this part that the event E1E2E3, which is

{supW:L^(W)L^0WFdK+1}{Xid1/2,1iN}{infWRm×d:σmin(W)<12σmin(W*)WFdK+1 L^(W)12C5σmin(W*)4},
holds true. In particular, for any W with risk less than 12C5σmin(W*)4, we have σmin(W)>12σmin(W*)>0 (in particular, any such W is invertible). Now, take any ζstationary point W generated by the gradient descent. Because of the event E3 and the fact that L^(W)<L^0, proven earlier, it holds that rank(W)=d, and from the definition of ζstationarity, we have
L^(W)Fζ.

Inspecting the proof of Theorem 4, we observe that

L^(W)=4W(1N1iNriXiXiT).

Thus, we arrive at

W(1N1iNriXiXiT)F4ζ.

Let

BW(1N1iNriXiXiT).

Note now that

1N1iNriXiXiT=(WTW)1WTB.

Next, we have

(WTW)12=1σmin(WTW)=1σmin(W)2<4σmin(W*)2,
because of conditioning on E3 (16). Furthermore,
WT2=W2=λmax(WTW)trace(WTW)=WFdK+1.

We now combine these findings.

1N1iNriXiXiTF=(WTW)1WTBF(WTW)12WTBF(WTW)12WT2BF16ζσmin(W*)2dK+1.

We now use the bounds on P(E1) as per (13), on P(E2) as per (14), and on P(E3) as per (17) to control P(E1E2E3). We conclude by the union bound that, with probability at least

1exp(Cd)(9d4K+9)d21·exp(C4Nd44K)Nd exp(Cd),
it holds that, for any W with L^(W)Fζ, its empirical risk is controlled as per (15):
L^(W)32ζσmin(W*)2d4K+4.(18)

Finally, because

ζϵ32σmin(W*)2d4K4
per (4), we deduce L^(W)ϵ as claimed. The running time is polynomial in ζ1 and d, and therefore, is polynomial in ϵ1,σmin(W*)1 and d. This completes the proof of Part (b).

4.11.3. Part (c).

Let WRm×d be such that L^(W)κ. Define the matrix

MWTW(W*)TW*.

We bound MF, which ensures weights WTW are uniformly close to ground truth weights defined as (W*)TW*. We start by conditioning: assume in the remainder that the event E2 in (14) stating Xid1/2 for every i[N] is true; this holds with probability at least 1Nd exp(Cd) as per Lemma 1.

Note that

L^(W)=1N1iN(XiTMXi)2.

To this end, consider a matrix ΞRN×d(d+1)/2, consisting of i.i.d. rows in which the ith row of Ξ is Ri(Xi(1)2,,Xi(d)2,Xi(k)Xi():1k<d)Rd(d+1)/2. Next, let

Σ=E[RiRiT]Rd(d+1)2×d(d+1)2,
where Ri is the ith row of matrix Ξ. Furthermore, let MRd(d+1)/2 be a vector consisting of entries M11,,Mdd and 2Mij,1i<jd. With this notation, if v=ΞMRN×1, then we have
L^(W)=v22/Nv22Nκ,
because L^(W)κ by assumption.

Next, we have

M=(ΞTΞ)1ΞTvM22(ΞTΞ)122ΞTv22.(19)

We start with the second term. Recall that v2Nκ, and we condition on Xi<d1/2,1iN. Next, using the Cauchy–Schwarz inequality,

|(ΞTv)i|v2NdNd1/2κ.(20)

Hence,

ΞTv22N2d3κ.(21)

We now control (ΞTΞ)122. This is done in a manner similar to the proof of Emschwiller et al. [33, theorem 3.2]. The main tool is the result Theorem 17 for concentration of the spectrum of random matrices with i.i.d. nonisotropic rows. The parameter setting we operate under is provided as follows.

Table

Table

ParameterValue
md 2
tN1/8
δN3/8d
γmax(Σ1/2δ,δ2)

Start by verifying that, because we condition on Xi<d1/2, it is indeed the case that the 2norm of each row of Ξ is at most d; thus, the preceding value of m works.

We now claim γ=Σ1/2δ. To prove this, it suffices to show

N>Σ4/3d83.

Using Emschwiller et al. [33, theorem 5.1] (also see Remark 2) with k = 2, we obtain σmin(Σ)cd4 for some absolute constant c > 0 depending only on the data coordinate distribution. Consequently,

Σ4/3σmin(Σ)4/3c4/3d16/3Σ4/3d83<c4/3d8,
which is below sample size N as requested. Therefore, γ=Σ1/2δ.

We now claim

12σmin(Σ)>γ=Σ1/2N38d.

This is equivalent to establishing

N>28/3Σ4/3d83σmin(Σ)8/3.

Using again Emschwiller et al. [33, theorem 5.1], we have Σ<fd4 for some absolute constant f > 0. This yields

28/3Σ4/3d83σmin(Σ)8/3<Cd563
for some absolute constant C>0, which again holds for our case as N>d18+43.

The rest is verbatim from Emschwiller et al. [33]. We now apply Theorem 17. With probability at least 1d2 exp(cN1/4) (here, c > 0 is an absolute constant), it holds that

1NΞTΞΣγ.(22)

Now, for D=d(d+1)/2,

1NΞTΞΣγvRD,|1NΞv22vTΣv|γv22,
which implies, for every v on the sphere SD1={vSD:v2=1},
1NΞv22vTΣvγ1Ninfv:v=1Ξv22infv:v=1 vTΣvγ.

Now, using the Courant–Fischer variational characterization of the smallest singular value (Horn and Johnson [47]), we obtain

σmin(Ξ)N(σmin(Σ)γ)>N2σmin(Σ),(23)
with probability at least 1exp(cN1/4), where c>0 is a positive absolute constant smaller than c.

We now return to (19) to specifically bound (ΞTΞ)1. Let A be any matrix A. Note that A1=σmin(A)1. Indeed, taking the singular value decomposition A=UΣVT and observing A1=(VT)1Σ1U1, we obtain A1=maxi(σi(A))1=σmin(A)1. This, together with (23), yields

(ΞTΞ)12Nσmin(Σ),(24)
with probability at least 1exp(cN1/4).

We now have all ingredients to execute the bound in (19). Combining Equations (21) and (24), we get

M=(ΞTΞ)1ΞTvM22(ΞTΞ)122·ΞTv224N2σmin(Σ)2from (24)·N2d3κfrom (21) =4κσmin(Σ)2d34Cκd11,
for some constant C > 0. Using (18) from Part (b), we have that κ can be taken
32ζσmin(W*)2d4K+4
with probability at least
1exp(Cd)(9d4K+9)d21·exp(C4Nd44K)Nd exp(Cd).

Because M224Cκd11 with probability at least 1exp(cN1/4), we have that

M2Cζd15/2+2Kσmin(W*)1
with probability at least
1exp(cN1/4)(9d4K+9)d21·exp(C4Nd44K)Nd exp(Cd),
by the union bound. As MFM2 and
ζϵCd15/22Kσmin(W*)
per (4), we arrive at WTW(W*)TW*Fϵ as claimed.

We now show the generalization ability. For any WRm×d, using auxiliary result Theorem 12(c), we have

L(W)μ22·trace(M)+max{μ4μ22,2μ22}·trace(M2),
where M=WTW(W*)TW*Rd×d. Now, note that trace(M)2=|1idMii|2d1idMii2dMF2 by Cauchy–Schwarz. Furthermore, trace(M2)=trace(MTM)=MF2. Thus,
L(W)MF2(dμ22+max{μ4μ22,2μ22})2dμ22MF2,
for d large. Because MF2M22(C)2ζd15+4Kσmin(W*)2, we obtain
L(W)ζ·2(C)2μ22d16+4Kσmin(W*)2.

Finally, because

ζϵ2(C)2μ22d164Kσmin(W*)2
per (4), we conclude the proof of generalization bound, that is, L(W)ϵ.

Remark 2.

The argument presented uses Emschwiller et al. [33, theorem 5.1]. Even though that result is stated for distributions supported on [1,1]d, it still applies under the weaker assumption that the distribution has finite moments of all orders; see Emschwiller et al. [33, remark 5.5].

4.11.4. Case of Constant d: d=O(1).

We provide a very brief sketch for the argument in the case d=O(1). The argument is quite similar to the one in Theorem 2. Similar to the analysis (of d=O(1) case) conducted for Theorem 2, we use the fact that, if X has a sub-Gaussian random variable, then E[|X|p]1/p=O(p) for every p1, and in particular, E[|X|p]< for all p1; see Vershynin [91, lemma 5.5] for a more precise statement.

Next, the upper bound on the energy value now modifies to 12C5¯σmin(W*)4.

4.11.5. Part (a).

Note that part (a) for the case of general d follows from earlier Claim 1. For the case when d is constant, part (a) now follows from modified Claim 1, provided under Theorem 2 for the case d=O(1). That is, for the event

E1{supW:L^(W)12C5¯σmin(W*)4WF=O(1)},
it is the case that P(E1)1O(1/N), where C5¯ is the constant appearing in Theorem 2.

Furthermore,

2L^(W)=Poly(WF,1N1iNXi2D)
for some absolute constant D > 0, and for any constant D > 0,
1N1iNXi2D=O(1)
with probability at least 1O(1/N) (where we use, in particular, the fact E[Xi(j)2D]<).

Combining these, we find that

Lsup{2L^(W):L^(W)12C5¯σmin(W*)4}=O(1)
with probability at least 1O(1/N).

4.11.6. Part (b).

The analysis for time horizon T remains intact. Furthermore, the entire analysis leading to (15) remains (nearly) intact, and this equation now modifies to

L^(W)O(1)·1N1iNriXiXiTF,
because the Frobenius norm terms involved (all of which are polynomials in d) are now O(1). The event E3 appearing in (16) modifies now to
E3{infWRm×d:σmin(W)<12σmin(W*)WFdK+1 L^(W)12C5¯σmin(W*)4},
which holds with probability at least 1O(1/N) (the modifications are exactly the same as those noted in Theorem 2 for the case d=O(1)). The rest of the analysis in Part (b) is exactly the same: combining modified versions of events E1 and E3 (note that there is no need to incorporate the event E2, which for the case of general d, is required for truncation) via a union bound and recalling Equation (4) on ζ, and it follows that, with probability at least 1O(1/N),L^(W)ϵ.

4.11.7. Part (c).

The modification for this part is as follows. First, we do not condition on E2 as earlier. Instead, we apply Chebyshev’s inequality (elaborated subsequently). The entire analysis leading up to (20) remains the same. Note now that ΞTvRd(d+1)/2. For each coordinate (ΞTv)i of this vector, we have, using Chebyshev’s inequality,

|(ΞTv)i|O(N)·v2N·κ·O(1),
with probability 1O(1/N). Taking a union bound over d(d+1)/2=O(1) coordinates yields that, with probability 1O(1/N), this remains true over all 1id(d+1)/2.

Next, to control |(ΞTΞ)122, we do not need a delicate concentration result (such as Theorem 17) as earlier. Instead, we take the following route.

Fix ϵ>0 to be tuned. Recall the notation Ri from earlier, where Ri is the ith row of matrix ΞRN×d(d+1)/2 and recall that Ri,1iN are i.i.d. random vectors. Using the outer product representation of matrix multiplication as before, we have

1NΞTΞ=1N1iNRiRiTRd(d+1)2×d(d+1)2.

Consequently,

E[1NΞTΞ]=E[RiRiT]=ΣRd(d+1)2×d(d+1)2,
where E[·] acts entry-wise.

Because d=O(1), a simple application of Chebyshev’s inequality together with a union bound over Θ(d4) entries (of N1ΞTΞ) yields

max1i,jd(d+1)/2|(1NΞTΞΣ)ij|ϵ
with probability at least 1O(1/N). Using now MMF2 valid for any matrix M, we obtain
1NΞTΞΣϵ2d4
with probability 1O(1/N). Similar to earlier, σmin(Σ)=Ω(1). Furthermore, the analysis starting from (22) and leading to (23) remains intact (with γ replaced with ϵ2d4,ϵ>0 to be tuned). In particular, for ϵ sufficiently small, it is the case that, with probability at least 1O(1/N),
σmin(Ξ)>N2σmin(Σ).

The rest of the analysis remains intact except that the probability bounds are now modified to 1O(1/N). Finally, recalling the bound (4) on ζ, we find that, with probability 1O(1/N),

WTW(W*)TW*FϵandL(W)ϵ.

4.12. Proof of Theorem 8

Let W0TW0=mId, and let {λ1,,λd}=σ((W*)TW*mId). In what follows, recall the quantities from the proof of Theorem 7(b): σ*Var((Wij*)21),χ2x2dω(x), where ω(x) is the semicircle law. Fix now an arbitrary ϵ>0 and a K > 0.

We start by defining several auxiliary events:

E1{1idλi2<4(1+o(1))md2χ2},E2{|1idλi|<σ*mddϵ},E3{σmin(W*)4116m2},E4{Xid1/2,1iN}.

Note that from the proof of Theorem 7(b) that we have P(Ei)1od(1) for i = 1, 2, 3, and from the union bound and sub-Gaussianity of X, P(E4)1N exp(Cd). Thus,

P(1i4Ei)1od(1)N exp(Cd).

In what follows, suppose we condition on the event 1i4Ei. Note that, in this conditional universe, it is still the case that Xi, 1iN are i.i.d. random vectors with centered i.i.d. coordinates. Using now Hölder’s inequality (Theorem 18) with p=1,q=,U=XiXiT, and V=(W*)TW*mId, we arrive at

|XiT((W*)TW*mId)Xi|=|XiXiT,(W*)TW*mId|(W*)TW*mIdtrace(XiXiT)2mdd2,
where we use the fact that trace(XiXiT)=Xi22d2 (recall the conditioning on E4). Using Hoeffding’s inequality, we have
L^(W0)=1N1iN(XiT(W*)TW*XiXiTW0TW0Xi)232L(W0),
with probability at least
1exp(CNd5m1),
where
L(W0)=E[(XT(W*)TW*XXTW0TW0X)2|Xd1/2].

Namely, L(W0) is the population risk in the conditional universe.

Next, in this conditional space, using Theorem 12(c), we arrive at

L(W0)μ2(1/2)2|1idλi|2+max{μ4(1/2)μ2(1/2)2,2μ2(1/2)2}(1idλi2).

Finally, carrying out the same analysis as in the end of the proof of Theorem 7, we deduce

L^(W0)<12C5σmin(W*)4,
provided m>Cd2 for a large enough constant C, namely, provided that the network is sufficiently overparameterized.

4.13. Proof of Theorem 9

  1. Let span(XiXiT:i[N])=S, the set of all d × d symmetric matrices, and let MS be such that for any i, XiTMXi=0. We establish M = 0. Let 1k,d be two fixed indices. To that end, let θi(k,)R be such that i=1Nθi(k,)XiXiT=ekeT+eekT, where the column vectors ek,eRd are, respectively, the kth and th elements of the standard basis for Rd. Such θi(k,) indeed exist because of the spanning property. Observe that 2Mk,=ekTMe+eTMek=tr(ekTMe+eTMek). Now, using the fact that tr(ABC)=tr(BCA)=tr(CAB) for all matrices A, B, C (with matching dimensions), we have

    2Mk,=tr(MeekT+MekeT)=tr(i=1Nθi(k,)MXiXiT)=i=1Nθi(k,)tr(XiTMXi)=0,
    for every k,[d]. Finally, if W is such that L^(W)=0, then XiTMXi=0 for any i, where M=(W*)TW*WTW. Hence, provided that the geometric condition holds, we have M = 0, that is, WTW=(W*)TW*. From here, the final conclusion follows per Theorem 14. Because WTW=(W*)TW*, W clearly has zero generalization error, that is, L(W)=0.

  2. Our goal is to construct a WRm×d with f(W*;Xi)=f(W;Xi) for every i[N], whereas WTW(W*)TW*. Consider the inner product A,B=trace(AB) in the space of all symmetric d × d matrices. Find 0MRd×d, a symmetric matrix, such that Mspan(XiXiT:i[N]), that is, XiTMXi=0 for every i[N]. We can find such M satisfying M2=1. Consider the linear matrix function M(δ)=(W*)TW*+δM. Note that M(δ) is symmetric for every δ. We claim that, under the hypothesis of the theorem, there exists a δ0>0 such that M(δ) is positive semidefinite for every δ[0,δ0] and that there exists WδRm×d with WδTWδ=M(δ) for all δ[0,δ0]. Observe that, because rank(W*)=d, then (W*)TW*Rd×d with rank((W*)TW*)=d. Therefore, the eigenvalues λ1*,,λd* of (W*)TW* are all positive. In particular {λi*:i[d]}[δ1,) with δ1=σmin(W*)2. Now, let μ1(δ),,μd(δ) be the eigenvalues of M(δ). Using Weyl’s inequality (Horn and Johnson [47]), we have |μi(δ)λi*|δM2=δ for every i. In particular, taking δδ1, we deduce, for every i[d], it holds that μi(δ)λi*δ10, that is, {μi(δ):i[d]}[0,). In particular, we also have that M(δ) is symmetric, and thus, it is positive semidefinite. Thus, there exists a Wδ¯Rd×d such that Wδ¯TWδ¯=M(δ). Now, using the same idea as in the proof of Theorem 1(c), we then deduce that, for any m^d, there exists a matrix WδRm^×d such that WδTWδ=Wδ¯TWδ¯=M(δ). In particular, for this Wδ, if f(Wδ,X) is the function computed by the neural network with weight matrix WδRm^×d, then on the training data, (Xi:i[N]),f(Wδ;Xi)=XiTWδTWδXi=XiT(W*)TW*Xi=f(W*;Xi) because XiTMXi=0 for all i[N]. At the same time, WδTWδ(W*)TW*=δM0 because δ0 and M0, and therefore, WδTWδ(W*)TW*.

Finally, to show L(Wδ)>0, we argue as follows. Suppose L(Wδ)=0. Then, by Theorem 13, it follows that ψ(X)=XTAX=0 identically, where A=WδTWδ(W*)TW*. Now, letting ξ1,,ξd be the eigenvectors of A (with corresponding eigenvalues λ1,,λd), we obtain ξiTAξi=λiξiTξi=λiξi22=0. We, namely, obtain λi=0 for every i[d]. Finally, because A is symmetric and, hence, admits a diagonalization of form A=QΛQ with diagonal entries of Λ being zero, we deduce A is identically zero, which contradicts with the fact that A=δM, which is a nonzero matrix.

4.14. Proof of Theorem 10

Recall that S={MRd×d:MT=M}. Note that this space has dimension (d2)+d. For any 1kd, it is easy to see that the matrices ekeT+eekT are linearly independent, and there are precisely (d2)+d such matrices. With this in mind, the statement of part (b) is immediate.

We now prove part (a) of the theorem. For any Xi, let Xi(j) be the jth coordinate of Xi with j[d], and let Yi be a d(d+1)/2dimensional vector obtained by retaining Xi(1)2,,Xi(d)2 and the products Xi(k)Xi() with 1k<d. Now, let X be an n×d(d+1)/2 matrix, whose rows are Y1,,Yn. Our goal is to establish

P[det(X)=0]=0,
when n=d(d+1)/2, where the probability is taken with respect to the randomness in X1,,Xn (in particular, this yields, for nd(d+1)/2,P(rank(X)=d(d+1)/2) almost surely). Now, recalling Theorem 13, it then suffices to show that det(X) is not identically zero when viewed as a polynomial in Xi(j) with i[N],j[d].

We now prove part (b) by providing a deterministic construction (of the matrix X) under which det(X)0. Let p1<<pd be distinct prime numbers. For every 1tN, set

Xt=(p1t1,,pdt1)TRd.

In particular, X1=(1,1,,1)TRd, which then implies that Y1 is a vector of all ones. Now, we study Y2. The entries of Y2, called z1,,zd(d+1)/2, are of form pi2 with i[d] or pipj, where 1i<jd. By the fundamental theorem of arithmetic, we have pipj=pkp{pi,pj}={pk,p}, and therefore, z1,,zd(d+1)/2 are pairwise distinct. With this construction, the matrix X is a Vandermonde matrix with determinant

1k<d(d+1)/2(zkz).

Because zkz for every k (from the construction on Y2, which, in turn, is constructed from X2), this determinant is nonzero, proving the claim.

4.14. Proof of Theorem 11

  1. Note that, if NN*, then, combining parts (a) of Theorems 9 and 10, we have that, with probability one, span(XiXiT:i[N])=S, which, together with L^(W)=0, imply that

    P(EØ)=0,
    where E={WRm×d:WTW(W*)TW*;L^(W)=0} from which the desired conclusion follows.

  2. Assume W is taken as in proof of Theorem 9(b), that is,

    A=(W*)TW*WTW=δMwhereδ=σmin(W*)2andM=1,
    with MT=M. Let {λ1,,λd} be the spectrum of the matrix δM. Using now Theorem 12(c), we have the lower bound
    L(W)E[Xi(j)2]2trace(A)2+min{Var(Xi(j)2),2E[Xi(j)2]2}·trace(A2)min{Var(Xi(j)2),2E[Xi(j)2]2}(i=1dλi2)min{Var(Xi(j)2),2E[Xi(j)2]2}λmax(δM)2,
    because trace(A2)=i=1dλi2. Finally, because λmax(δM)2=δ2=σmin(W*)4 (as the spectral norm of M is one), we arrive at the desired conclusion.

Acknowledgments

The authors thank the anonymous reviewers for their very detailed feedback, which improved the presentation of this paper, and Orestis Plevrakis for providing useful remarks on the initial version of this paper. Part of this work was done when D. Gamarnik and E. C. Kızıldağ were visiting the Simons Institute for the Theory of Computing at the University of California, Berkeley in Fall 2020.

References

  • [1] Arora S, Ge R, Neyshabur B, Zhang Y (2018) Stronger generalization bounds for deep nets via a compression approach. Preprint, submitted February 14, https://arxiv.org/abs/1802.05296.Google Scholar
  • [2] Arora S, Du SS, Hu W, Li Z, Wang R (2019) Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. Preprint, submitted January 24, https://arxiv.org/abs/1901.08584.Google Scholar
  • [3] Bai ZD, Yin YQ (1988) Convergence to the semicircle law. Ann. Probab. 16(2):863–875.CrossrefGoogle Scholar
  • [4] Bai ZD, Yin YQ (1993) Limit of the smallest eigenvalue of a large dimensional sample covariance matrix. Ann. Probab. 21(3):1275–1294.CrossrefGoogle Scholar
  • [5] Barron AR (1994) Approximation and estimation bounds for artificial neural networks. Machine Learn. 14(1):115–133.CrossrefGoogle Scholar
  • [6] Bartlett PL, Foster DJ, Telgarsky MJ (2017) Spectrally-normalized margin bounds for neural networks. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 6240–6249.Google Scholar
  • [7] Bartlett PL, Harvey N, Liaw C, Mehrabian A (2019) Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks. J. Machine Learn. Res. 20(63):1–17.Google Scholar
  • [8] Bhatia R (2013) Matrix Analysis, vol. 169 (Springer Science & Business Media, Berlin, Heidelberg).Google Scholar
  • [9] Blum A, Rivest RL (1989) Training a 3-node neural network is NP-complete. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 494–501.Google Scholar
  • [10] Bölcskei H, Grohs P, Kutyniok G, Petersen P (2019) Optimal approximation with sparsely connected deep neural networks. SIAM J. Math. Data Sci. 1(1):8–45.Google Scholar
  • [11] Brutzkus A, Globerson A (2017) Globally optimal gradient descent for a convnet with gaussian inputs. Proc. 34th Internat. Conf. Machine Learn., vol. 70 (JMLR.org), 605–614.Google Scholar
  • [12] Brutzkus A, Globerson A, Malach E, Shalev-Shwartz S (2017) SGD learns over-parameterized networks that provably generalize on linearly separable data. Preprint, submitted October 27, https://arxiv.org/abs/1710.10174.Google Scholar
  • [13] Candès EJ, Li X (2014) Solving quadratic equations via phaselift when there are about as many equations as unknowns. Foundations Comput. Math. 14:1017–1026.CrossrefGoogle Scholar
  • [14] Candès EJ, Plan Y (2011) Tight oracle inequalities for low-rank matrix recovery from a minimal number of noisy random measurements. IEEE Trans. Inform. Theory 57(4):2342–2359.CrossrefGoogle Scholar
  • [15] Candès EJ, Strohmer T, Voroninski V (2013) Phaselift: Exact and stable signal recovery from magnitude measurements via convex programming. Comm. Pure Appl. Math. 66(8):1241–1274.CrossrefGoogle Scholar
  • [16] Caron R, Traynor T (2005) The zero set of a polynomial. WSMR Report 05-02, University of Windsor, Windsor, Canada.Google Scholar
  • [17] Chen TQ, Rubanova Y, Bettencourt J, Duvenaud DK (2018) Neural ordinary differential equations. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 6571–6583.Google Scholar
  • [18] Chizat L, Bach F (2018) On the global convergence of gradient descent for over-parameterized models using optimal transport. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 3036–3046.Google Scholar
  • [19] Choromanska A, Henaff M, Mathieu M, Ben Arous G, LeCun Y (2015) The loss surfaces of multilayer networks. Artificial Intelligence Statist. (PMLR, New York), 192–204.Google Scholar
  • [20] Collobert R, Weston J (2008) A unified architecture for natural language processing: Deep neural networks with multitask learning. Proc. 25th Internat. Conf. Machine Learn. (ACM, New York), 160–167.Google Scholar
  • [21] De Fauw J, Ledsam JR, Romera-Paredes B, Nikolov S, Tomasev N, Blackwell S, Askham H, et al. (2018) Clinically applicable deep learning for diagnosis and referral in retinal disease. Nature Medicine 24(9):1342–1350.CrossrefGoogle Scholar
  • [22] Demanet L, Hand P (2014) Stable optimizationless recovery from phaseless linear measurements. J. Fourier Anal. Appl. 20:199–221.CrossrefGoogle Scholar
  • [23] Deng Y, Li Z, Song Z (2023) An improved sample complexity for rank-1 matrix sensing. Preprint, submitted March 13, https://arxiv.org/abs/2303.06895.Google Scholar
  • [24] Du SS, Lee JD (2018) On the power of over-parametrization in neural networks with quadratic activation, Preprint, submitted March 3, https://arxiv.org/abs/1803.01206.Google Scholar
  • [25] Du SS, Lee JD, Tian Y (2017) When is a convolutional filter easy to learn? Preprint, submitted September 18, https://arxiv.org/abs/1709.06129.Google Scholar
  • [26] Du SS, Zhai X, Poczos B, Singh A (2018) Gradient descent provably optimizes over-parameterized neural networks. Preprint, submitted October 4, https://arxiv.org/abs/1810.02054.Google Scholar
  • [27] Du SS, Lee JD, Li H, Wang L, Zhai X (2018) Gradient descent finds global minima of deep neural networks. Preprint, submitted November 9, https://arxiv.org/abs/1811.03804.Google Scholar
  • [28] Du SS, Lee JD, Tian Y, Poczos B, Singh A (2017) Gradient descent learns one-hidden-layer CNN: Don’t be afraid of spurious local minima. Preprint, submitted December 3, https://arxiv.org/abs/1712.00779.Google Scholar
  • [29] Du SS, Jin C, Lee JD, Jordan MI, Singh A, Poczos B (2017) Gradient descent can take exponential time to escape saddle points. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 1067–1077.Google Scholar
  • [30] Dziugaite GK, Roy DM (2017) Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. Preprint, submitted March 31, https://arxiv.org/abs/1703.11008.Google Scholar
  • [31] Eldan R, Shamir O (2016) The power of depth for feedforward neural networks. Conf. Learn. Theory (PMLR, New York), 907–940.Google Scholar
  • [32] Eldar YC, Mendelson S (2014) Phase retrieval: Stability and recovery guarantees. Appl. Comput. Harmonic Anal. 36(3):473–494.Google Scholar
  • [33] Emschwiller M, Gamarnik D, Kızıldağ EC, Zadik I (2020) Neural networks and polynomial regression. Demystifying the overparametrization phenomena. Preprint, submitted March 23, https://arxiv.org/abs/2003.10523.Google Scholar
  • [34] Freeman CD, Bruna J (2016) Topology and geometry of half-rectified network optimization. Preprint, submitted November 4, https://arxiv.org/abs/1611.01540.Google Scholar
  • [35] Fulton W (2000) Eigenvalues, invariant factors, highest weights, and Schubert calculus. Bull. Amer. Math. Soc. 37(3):209–249.CrossrefGoogle Scholar
  • [36] Ge R, Lee JD, Ma T (2017) Learning one-hidden-layer neural networks with landscape design. Preprint, submitted November 1, https://arxiv.org/abs/1711.00501.Google Scholar
  • [37] Ge R, Huang F, Jin C, Yuan Y (2015) Escaping from saddle points—Online stochastic gradient for tensor decomposition. Conf. Learn. Theory (PMLR, New York), 797–842.Google Scholar
  • [38] Goel S, Kanade V, Klivans A, Thaler J (2016) Reliably learning the ReLU in polynomial time. Preprint, submitted November 30, https://arxiv.org/abs/1611.10258.Google Scholar
  • [39] Golowich N, Rakhlin A, Shamir O (2017) Size-independent sample complexity of neural networks. Preprint, submitted December 18, https://arxiv.org/abs/1712.06541.Google Scholar
  • [40] Gonon L, Grigoryeva L, Ortega J-P (2020) Approximation bounds for random neural networks and reservoir systems. Preprint, submitted February 14, https://arxiv.org/abs/2002.05933.Google Scholar
  • [41] Gunasekar S, Woodworth BE, Bhojanapalli S, Neyshabur B, Srebro N (2017) Implicit regularization in matrix factorization. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 30.Google Scholar
  • [42] Haeffele BD, Vidal R (2015) Global optimality in tensor factorization, deep learning, and beyond. Preprint, submitted June 24, https://arxiv.org/abs/1506.07540.Google Scholar
  • [43] Haeffele B, Young E, Vidal R (2014) Structured low-rank matrix factorization: Optimality, algorithm, and applications to image processing. Internat. Conf. Machine Learn. (PMLR, New York), 2007–2015.Google Scholar
  • [44] Hardt M, Ma T (2016) Identity matters in deep learning. Preprint, submitted November 14, https://arxiv.org/abs/1611.04231.Google Scholar
  • [45] Harvey N, Liaw C, Mehrabian A (2017) Nearly-tight VC-dimension bounds for piecewise linear neural networks. Conf. Learn. Theory (PMLR, New York), 1064–1068.Google Scholar
  • [46] He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. Proc. IEEE Conf. Comput. Vision Pattern Recognition (IEEE, Piscataway, NJ), 770–778.Google Scholar
  • [47] Horn RA, Johnson CR (2012) Matrix Analysis (Cambridge University Press, Cambridge, MA).CrossrefGoogle Scholar
  • [48] Huang GB, Zhu QY, Siew CK (2006) Extreme learning machine: Theory and applications. Neurocomputing 70(1–3):489–501.CrossrefGoogle Scholar
  • [49] Janzamin M, Sedghi H, Anandkumar A (2015) Beating the perils of non-convexity: Guaranteed training of neural networks using tensor methods. Preprint, submitted June 28, https://arxiv.org/abs/1506.08473.Google Scholar
  • [50] Jin C, Ge R, Netrapalli P, Kakade SM, Jordan MI (2017) How to escape saddle points efficiently. Proc. 34th Internat. Conf. Machine Learn., vol. 70 (JMLR.org), 1724–1732.Google Scholar
  • [51] Kawaguchi K (2016) Deep learning without poor local minima. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 586–594.Google Scholar
  • [52] Khodak M, Tenenholtz N, Mackey L, Fusi N (2021) Initialization and regularization of factorized neural layers. Preprint, submitted May 3, https://arxiv.org/abs/2105.01029.Google Scholar
  • [53] Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 1097–1105.Google Scholar
  • [54] Lee JD, Simchowitz M, Jordan MI, Recht B (2016) Gradient descent only converges to minimizers. Conf. Learn. Theory (PMLR, New York), 1246–1257.Google Scholar
  • [55] Levy KY (2016) The power of normalization: Faster evasion of saddle points. Preprint, submitted November 15, https://arxiv.org/abs/1611.04831.Google Scholar
  • [56] Li Y, Yuan Y (2017) Convergence analysis of two-layer neural networks with ReLU activation. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 597–607.Google Scholar
  • [57] Li Y, Ma T, Zhang H (2018) Algorithmic regularization in over-parameterized matrix sensing and neural networks with quadratic activations. Conf. Learn. Theory (PMLR, New York), 2–47.Google Scholar
  • [58] Liang T, Poggio T, Rakhlin A, Stokes J (2017) Fisher-Rao metric, geometry, and complexity of neural networks. Preprint, submitted November 5, https://arxiv.org/abs/1711.01530.Google Scholar
  • [59] Livni R, Shalev-Shwartz S, Shamir O (2014) On the computational efficiency of training neural networks. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 855–863.Google Scholar
  • [60] Mannelli SS, Vanden-Eijnden E, Zdeborová L (2020) Optimization and generalization of shallow neural networks with quadratic activation functions. Preprint, submitted June 27, https://arxiv.org/abs/2006.15459.Google Scholar
  • [61] Mendelson S (2017) Extending the scope of the small-ball method. Preprint, submitted September 4, https://arxiv.org/abs/1709.00843.Google Scholar
  • [62] Mhaskar H, Liao Q, Poggio T (2016) Learning functions: When is deep better than shallow. Preprint, submitted March 3, https://arxiv.org/abs/1603.00988.Google Scholar
  • [63] Mohamed AR, Dahl GE, Hinton G (2011) Acoustic modeling using deep belief networks. IEEE Trans. Audio Speech Language Processing 20(1):14–22.CrossrefGoogle Scholar
  • [64] Neyshabur B, Bhojanapalli S, Srebro N (2017) A PAC-Bayesian approach to spectrally-normalized margin bounds for neural networks. Preprint, submitted July 29, https://arxiv.org/abs/1707.09564.Google Scholar
  • [65] Neyshabur B, Tomioka R, Srebro N (2015) Norm-based capacity control in neural networks. Conf. Learn. Theory (PMLR, New York), 1376–1401.Google Scholar
  • [66] Neyshabur B, Bhojanapalli S, McAllester D, Srebro N (2017) Exploring generalization in deep learning. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 5947–5956.Google Scholar
  • [67] Nguyen Q, Hein M (2017) The loss surface of deep and wide neural networks. Proc. 34th Internat. Conf. Machine Learn., vol. 70 (JMLR.org), 2603–2612.Google Scholar
  • [68] Nguyen Q, Hein M (2018) The loss surface and expressivity of deep convolutional neural networks (OpenReview.net).Google Scholar
  • [69] Pennington J, Worah P (2017) Nonlinear random matrix theory for deep learning. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 2637–2646.Google Scholar
  • [70] Poggio T, Mhaskar H, Rosasco L, Miranda B, Liao Q (2017) Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review. Internat. J. Automation Comput. 14(5):503–519.CrossrefGoogle Scholar
  • [71] Poston T, Lee CN, Choie Y, Kwon Y (1991) Local minima and back propagation. IJCNN-91-Seattle Internat. Joint Conf. Neural Networks, vol. 2 (IEEE, Piscataway, NJ), 173–176.Google Scholar
  • [72] Qin L, Song Z, Zhang R (2023) A general algorithm for solving rank-one matrix sensing. Preprint, submitted March 22, https://arxiv.org/abs/2303.12298.Google Scholar
  • [73] Rahimi A, Recht B (2009) Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 1313–1320.Google Scholar
  • [74] Recht B, Fazel M, Parrilo PA (2010) Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Rev. 52(3):471–501.CrossrefGoogle Scholar
  • [75] Rotskoff GM, Vanden-Eijnden E (2018) Neural networks as interacting particle systems: Asymptotic convexity of the loss landscape and universal scaling of the approximation error. Preprint, submitted May 2, https://arxiv.org/abs/1805.00915.Google Scholar
  • [76] Rudin W (1964) Principles of Mathematical Analysis, vol. 3 (McGraw-Hill, New York).Google Scholar
  • [77] Safran I, Shamir O (2017) Spurious local minima are common in two-layer ReLU neural networks. Preprint, submitted December 24, https://arxiv.org/abs/1712.08968.Google Scholar
  • [78] Schmidt-Hieber J (2017) Nonparametric regression using deep neural networks with relu activation function. Preprint, submitted August 22, https://arxiv.org/abs/1708.06633.Google Scholar
  • [79] Sedghi H, Anandkumar A (2014) Provable methods for training neural networks with sparse connectivity. Preprint, submitted December 8, https://arxiv.org/abs/1412.2693.Google Scholar
  • [80] Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, Hubert T, et al. (2017) Mastering the game of go without human knowledge. Nature 550(7676):354–359.CrossrefGoogle Scholar
  • [81] Sirignano J, Spiliopoulos K (2020) Mean field analysis of neural networks: A central limit theorem. Stochastic Processes Appl. 130(3):1820–1852.CrossrefGoogle Scholar
  • [82] Soltanolkotabi M (2017) Learning ReLUs via gradient descent. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 2007–2017.Google Scholar
  • [83] Soltanolkotabi M, Javanmard A, Lee JD (2018) Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. IEEE Trans. Inform. Theory 65(2):742–769.Google Scholar
  • [84] Song M, Montanari A, Nguyen P (2018) A mean field view of the landscape of two-layers neural networks. Proc. Natl. Acad. Sci. USA 115(33):E7665–E7671.Google Scholar
  • [85] Soudry D, Carmon Y (2016) No bad local minima: Data independent training error guarantees for multilayer neural networks. Preprint, submitted May 26, https://arxiv.org/abs/1605.08361.Google Scholar
  • [86] Soudry D, Hoffer E (2017) Exponentially vanishing sub-optimal local minima in multilayer neural networks. Preprint, submitted February 19, https://arxiv.org/abs/1702.05777.Google Scholar
  • [87] Stöger D, Soltanolkotabi M (2021) Small random initialization is akin to spectral learning: Optimization and generalization guarantees for overparameterized low-rank matrix reconstruction. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), vol. 34, 23831–23843.Google Scholar
  • [88] Telgarsky M (2016) Benefits of depth in neural networks. Preprint, submitted February 14, https://arxiv.org/abs/1602.04485.Google Scholar
  • [89] Tian Y (2017) An analytical formula of population gradient for two-layered relu network and its applications in convergence and critical point analysis. Proc. 34th Internat. Conf. Machine Learn., vol. 70 (JMLR.org), 3404–3413.Google Scholar
  • [90] Venturi L, Bandeira AS, Bruna J (2019) Spurious valleys in one-hidden-layer neural network optimization landscapes. J. Machine Learn. Res. 20(133):1–34.Google Scholar
  • [91] Vershynin R (2010) Introduction to the non-asymptotic analysis of random matrices. Preprint, submitted November 12, https://arxiv.org/abs/1011.3027.Google Scholar
  • [92] Vodrahalli K, Shivanna R, Sathiamoorthy M, Jain S, Chi EH (2022) Nonlinear initialization methods for low-rank neural networks. Preprint, submitted February 2, https://arxiv.org/abs/2202.00834.Google Scholar
  • [93] Wei C, Lee JD, Liu Q, Ma T (2018) On the margin theory of feedforward neural networks. Preprint, submitted October 12, https://arxiv.org/abs/1810.05369.Google Scholar
  • [94] Weinan E, Han J, Jentzen A (2017) Deep learning-based numerical methods for high-dimensional parabolic partial differential equations and backward stochastic differential equations. Comm. Math. Statist. 5(4):349–380.CrossrefGoogle Scholar
  • [95] Wu L, Zhu Z, W E (2017) Toward understanding generalization of deep learning: Perspective of loss landscapes. Preprint, submitted June 30, https://arxiv.org/abs/1706.10239.Google Scholar
  • [96] Zhang C, Bengio S, Hardt M, Recht B, Vinyals O (2016) Understanding deep learning requires rethinking generalization. Preprint, submitted November 10, https://arxiv.org/abs/1611.03530.Google Scholar
  • [97] Zhong K, Jain P, Dhillon IS (2015) Efficient matrix sensing using rank-1 gaussian measurements. Algorithmic Learn. Theory: 26th Internat. Conf. Proc. (Springer, Berlin, Heidelberg), 3–18.Google Scholar
  • [98] Zhong K, Song Z, Dhillon IS (2017) Learning non-overlapping convolutional neural networks with multiple kernels. Preprint, submitted November 8, https://arxiv.org/abs/1711.03440.Google Scholar
  • [99] Zhong K, Song Z, Jain P, Bartlett PL, Dhillon IS (2017) Recovery guarantees for one-hidden-layer neural networks. Proc. 34th Internat. Conf. Machine Learn., vol. 70 (JMLR.org), 4140–4149.Google Scholar
  • [100] Zhou P, Feng J (2017) The landscape of deep learning algorithms. Preprint, submitted May 19, https://arxiv.org/abs/1705.07038.Google Scholar
  • [101] Zhou Y, Liang Y (2017) Critical points of neural networks: Analytical forms and landscape properties. Preprint, submitted October 30, https://arxiv.org/abs/1710.11205.Google Scholar